skip to main content
research-article

IIITH-CSTD Corpus: Crowdsourced Strategies for the Collection of a Large-scale Telugu Speech Corpus

Published: 20 July 2023 Publication History

Abstract

Due to the lack of a large annotated speech corpus, many low-resource Indian languages struggle to utilize recent advancements in deep neural network architectures for Automatic Speech Recognition (ASR) tasks. Collecting large-scale databases is an expensive and time-consuming task. Current approaches lack extensive traditional expert-based data acquisition guidelines, as they are tedious and complex. In this work, we present the International Institute of Information Technology Hyderabad-Crowd Sourced Telugu Database (IIITH-CSTD), a Telugu corpus collected through crowdsourcing. In particular, our main objective is to mitigate the low-resource problem for Telugu. We also present the sources, crowdsourcing pipeline, and the protocols used to collect the corpus for a low-resource language, namely, Telugu. Data of approximately 2,000 hours of transcribed audio is presented and released in this article, covering three major regional dialects of the Telugu language in three different (i.e., read, conversational and spontaneous) speaking styles on topics such as politics, sports, and arts, science, and so on. We also present the experimental results of the collected corpus on ASR tasks. We hope this work will motivate researchers to curate large-scale annotated speech data for other low-resource Indic languages.

References

[1]
Basil Abraham, Danish Goel, Divya Siddarth, Kalika Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi, Preethi Jyoti, Sunayana Sitaram, and Vivek Seshadri. 2020. Crowdsourcing speech data for low-resource languages from low-income workers. In 12th Language Resources and Evaluation Conference. 2819–2826.
[2]
Alejandro Acero and Richard M. Stern. 1990. Environmental robustness in automatic speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP’90). IEEE, 849–852.
[3]
Ahmed Ali, Yifan Zhang, Patrick Cardinal, Najim Dahak, Stephan Vogel, and James Glass. 2014. A complete KALDI recipe for building Arabic speech recognition systems. In IEEE Spoken Language Technology Workshop (SLT’14). IEEE, 525–529.
[4]
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019).
[5]
Sunita Arora, Karunesh Kumar Arora, Mukund Kumar Roy, Shyam Sunder Agrawal, and B. K. Murthy. 2016. Collaborative speech data acquisition for under resourced languages through crowdsourcing. Procedia Comput. Sci. 81 (2016), 37–44.
[6]
Vivek Bhardwaj, Sashi Bala, Virender Kadyan, and Vinay Kukreja. 2020. Development of robust automatic speech recognition system for children’s using Kaldi toolkit. In 2nd International Conference on Inventive Research in Computing Applications (ICIRCA’20). IEEE, 10–13.
[7]
Jayadev Billa. 2018. ISI ASR system for the low resource speech recognition challenge for Indian Languages. In INTERSPEECH Conference. 3207–3211.
[8]
Alena Butryna, Shan-Hui Cathy Chu, Isin Demirsahin, Alexander Gutkin, Linne Ha, Fei He, Martin Jansche, Cibu Johny, Anna Katanova, Oddur Kjartansson et al. 2020. Google crowdsourced speech corpora and related open-source resources for low-resource languages and dialects: An overview. arXiv preprint arXiv:2010.06778 (2020).
[9]
Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang et al. 2021. GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909 (2021).
[10]
George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2011. Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). IEEE, 4688–4691.
[11]
Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan, Tejaswi Seeram, and Basil Abraham. 2021. Multilingual and code-switching ASR challenges for low resource Indian languages. In INTERSPEECH Conference.
[12]
Christian Dugast and Laurence Devillers. 1992. Incorporating acoustic-phonetic knowledge in hybrid TDNN/HMM frameworks. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’92). IEEE, 421–424.
[13]
Christian Dugast, Laurence Devillers, and Xavier Aubert. 1994. Combining TDNN and HMM in a hybrid system for improved continuous-speech recognition. IEEE Trans. Speech Audio Process. 2, 1 (1994), 217–223.
[14]
J. César Félix-Brasdefer. 2010. Data collection methods in speech act performance. Speech Act Perform.: Theoret, Empir. Methodol. Issues 26, 41 (2010), 69–82.
[15]
Joao Freitas, António Calado, Daniela Braga, Pedro Silva, and M. Dias. 2010. Crowdsourcing platform for large-scale speech data collection. VI Jornadas en Tecnología del Habla and II Iberian SLTech Workshop (FALA’10).
[16]
Masakiyo Fujimoto and Y. A. Riki. 2004. Robust speech recognition in additive and channel noise environments using GMM and EM algorithm. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04). IEEE, I–941.
[17]
Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, and Vijay Janapa Reddi. 2021. The people’s speech: A large-scale diverse English speech recognition dataset for commercial usage. arXiv preprint arXiv:2111.09344 (2021).
[18]
John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, and David S. Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Tech. Rep. N 93 (1993), 27403.
[19]
John J. Godfrey, Edward C. Holliman, and Jane McDaniel. 1992. SWITCHBOARD: Telephone speech corpus for research and development. In IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP’92). IEEE Computer Society, 517–520.
[20]
Reinhold Haeb-Umbach and Hermann Ney. 1992. Linear discriminant analysis for improved large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP’92). Citeseer, 13–16.
[21]
Trym Holter. 1965. Automatic speech recognition. IEEE Transactions on Audio and Electroacoustics 13, 1 (1965), 1–5.
[22]
Xuedong Huang, James Baker, and Raj Reddy. 2014. A historical perspective of speech recognition. Commun. ACM 57, 1 (2014), 94–103.
[23]
Navdeep Jaitly, Patrick Nguyen, Andrew Senior, and Vincent Vanhoucke. 2012. Application of pretrained deep neural networks to large vocabulary speech recognition. (2012), 1–4.
[24]
Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh M. Khapra. 2021. Towards building ASR systems for the next billion users. ArXiv abs/2111.03945 (2021).
[25]
Patrik Jonell, Catharine Oertel, Dimosthenis Kontogiorgos, Jonas Beskow, and Joakim Gustafson. 2018. Crowdsourced multimodal corpora collection tool. In 11th International Conference on Language Resources and Evaluation (LREC’18).
[26]
Preethi Jyothi and Mark Hasegawa-Johnson. 2015. Acquiring speech transcriptions using mismatched crowdsourcing. In AAAI Conference on Artificial Intelligence.
[27]
Martin Karafiát, Lukáš Burget, Pavel Matějka, Ondřej Glembek, and Jan Černockỳ. 2011. iVector-based discriminative adaptation for automatic speech recognition. In IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE, 152–157.
[28]
Chanwoo Kim and Richard M. Stern. 2008. Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In 9th Annual Conference of the International Speech Communication Association.
[29]
Oddur Kjartansson, Supheakmungkol Sarin, Knot Pipatsrisawat, Martin Jansche, and Linne Ha. 2018. Crowd-sourced speech corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU’18 29-31 August 2018, Gurugram, India).
[30]
Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Ashish Seth, Lodagala Durga Prasad, Saish Jaiswal, Anusha Prakash, Hema Murthy et al. 2021. Dual script E2E framework for multilingual and code-switching ASR. arXiv preprint arXiv:2106.01400 (2021).
[31]
Jinyu Li, Dong Yu, Jui-Ting Huang, and Yifan Gong. 2012. Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM. In IEEE Spoken Language Technology Workshop (SLT’12). IEEE, 131–136.
[32]
Longfei Li, Yong Zhao, Dongmei Jiang, Yanning Zhang, Fengna Wang, Isabel Gonzalez, Enescu Valentin, and Hichem Sahli. 2013. Hybrid deep neural network–hidden Markov model (DNN-HMM) based speech emotion recognition. In Humaine Association Conference on Affective Computing and Intelligent Interaction. IEEE, 312–317.
[33]
Srikanth Madikeri, Subhadeep Dey, Petr Motlicek, and Marc Ferras. 2016. Implementation of the Standard I-vector System for the Kaldi Speech Recognition Toolkit. Technical Report. Idiap.
[34]
Ganesh S. Mirishkar, Meher Dinesh Naroju, Sudhamay Maity, Prakash Yalla, Anil Kumar Vuppala et al. 2021. CSTD-Telugu corpus: Crowd-sourced approach for large-scale speech data collection. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC’21). IEEE, 511–517.
[35]
Ganesh S. Mirishkar, Vishnu Vidyadhara Raju V, Meher Dinesh Naroju, Sudhamay Maity, Prakash Yalla, and Anil Kumar Vuppala. 2021. CSTD-Telugu corpus: Crowd-sourced approach for large-scale speech data collection. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC’21). 511–517.
[36]
José Novoa, Josué Fredes, Víctor Poblete, and Néstor Becerra Yoma. 2018. Uncertainty weighting and propagation in DNN–HMM-based speech recognition. Comput. Speech Lang. 47 (2018), 30–46.
[37]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. LibriSpeech: An ASR corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). IEEE, 5206–5210.
[38]
Douglas B. Paul and Janet Baker. 1992. The design for the Wall Street Journal-based CSR corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23–26, 1992. ACL.
[39]
Vijayaditya Peddinti, Guoguo Chen, Vimal Manohar, Tom Ko, Daniel Povey, and Sanjeev Khudanpur. 2015. JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). IEEE, 539–546.
[40]
Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In 16th Annual Conference of the International Speech Communication Association.
[41]
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz et al. 2011. The Kaldi speech recognition toolkit. In IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society.
[42]
K. Prasad, S. Virk, Miki Nishioka, and C. Kaushik. 2018. Crowd-sourced technical texts can help revitalise Indian languages. In International Conference on Language Resources and Evaluation. 11–16.
[43]
D. Raj Reddy. 1976. Speech recognition by machine: A review. Proc. IEEE 64, 4 (1976), 501–531.
[44]
Vishwas M. Shetty and S. Umesh. 2021. Exploring the use of common label set to improve speech recognition of low resource Indian languages. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 7228–7232.
[45]
Brij Mohan Lal Srivastava, Sunayana Sitaram, Rupesh Kumar Mehta, Krishna Doss Mohan, Pallavi Matani, Sandeepkumar Satpal, Kalika Bali, Radhakrishnan Srikanth, and Niranjan Nayak. 2018. Interspeech 2018 low resource automatic speech recognition challenge for Indian languages. In Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU’18). 11–14.
[46]
Hari Krishna Vydana, Krishna Gurugubelli, Vishnu Vidyadhara Raju Vegesna, and Anil Kumar Vuppala. 2018. An exploration towards joint acoustic modeling for Indian languages: IIIT-H submission for low resource speech recognition challenge for Indian languages. In INTERSPEECH Conference. 3192–3196.
[47]
Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen et al. 2018. ESPnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015 (2018).
[48]
Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256 (2016).
[49]
Dong Yu and Li Deng. 2016. Automatic Speech Recognition. Vol. 1. Springer.
[50]
Yong Zhao and Biing-Hwang Juang. 2012. Stranded Gaussian mixture hidden Markov models for robust speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). IEEE, 4301–4304.
[51]
Marina Zimmermann, Mostafa Mehdipour Ghazi, Hazım Kemal Ekenel, and Jean-Philippe Thiran. 2016. Visual speech recognition using PCA networks and LSTMs in a tandem GMM-HMM system. In Asian Conference on Computer Vision. Springer, 264–276.

Index Terms

  1. IIITH-CSTD Corpus: Crowdsourced Strategies for the Collection of a Large-scale Telugu Speech Corpus

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 7
        July 2023
        422 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3610376
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 20 July 2023
        Online AM: 12 June 2023
        Accepted: 10 May 2023
        Revised: 18 March 2023
        Received: 15 June 2022
        Published in TALLIP Volume 22, Issue 7

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Speech recognition
        2. dialects
        3. TDNN
        4. End-to-End
        5. resource creation
        6. low-resource languages

        Qualifiers

        • Research-article

        Funding Sources

        • Technology Development for Indian Languages (TDIL), Ministry of Electronics and Information Technology (MeitY), Government of the Republic of India

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 328
          Total Downloads
        • Downloads (Last 12 months)106
        • Downloads (Last 6 weeks)11
        Reflects downloads up to 30 Jan 2025

        Other Metrics

        Citations

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media