| 研究生: |
伍家恩 Chia-En Wu |
|---|---|
| 論文名稱: | A Corpus Crawler for Taiwanese Mandarin Audio Transcription Using Deep Speech |
| 指導教授: |
孫敏德
Min-Te Sun |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 47 |
| 中文關鍵詞: | 語音辨識 、台灣口音 、資料集處理 |
| 外文關鍵詞: | Common Voice, Deep Speech, Speech Recognition |
| 相關次數: | 點閱:6 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著科技的發展,語音辨識技術逐漸被應用在各個領域,例如語音輸入和智慧助理。近年來,隨著深度學習技術不斷的發展,許多主流語言的語音辨識模型和相關的資料集也逐漸被釋出,例如英語和中國口音的中文。因此,這些主流語言的語音辨識準確率通常遠高於其他比較小眾的語言(例如:台灣口音的中文)。台灣口音的中文在很多方面都與中國口音的中文不盡相同,唯獨句子結構是比較相近的。因此,若想要讓針對中國口音開發的中文語音辨識模型也能夠正確的辨識台灣口音的中文,我們必須先收集大量的台灣口音資料集來重新訓練該模型,才能得到不錯的效果。
因此,我們在本篇論文提出了一套針對台灣口音的中文語音資料集的收集系統,它可以自動從YouTube的影片中收集台灣口音的中文聲音檔和以及對應的文本資料;透過YouTube的CC字幕,我們大大簡化了收集資料的過程,讓收集語音資料集的速度大幅提升。此外,我們還設計了一系列的預處理演算法,用來解決文本資料集相關的發音問題,其中包括去除不必要的內容(例如:多餘的換行、空格、標點符號以及外來語言的文字…等)和辨識阿拉伯數字的正確中文發音。我們利用這套系統從YouTube上收集了30小時的台灣口音的中文語音資料集,用來改善Deep Speech語音辨識模型的準確率。而最終的實驗結果表明,隨著我們使用的資料集增加,語音辨識模型的平均字詞錯誤率以非線性的方式逐步下降改進。
Speech recognition is considered to be an enabling technology for many services, such as voice input and smart assistant. As the technique of Deep Learning develops, many speech recognition models and public corpus datasets have been released for common languages, such as English and Chinese Mandarin. As a consequence, the accuracy of speech recognition for these common languages is usually much higher than that for Taiwanese Mandarin. While Taiwanese Mandarin is different from Chinese Mandarin in several ways, they share a very similar sentence structure. Hence, the models developed for Chinese Mandarin should work well for Taiwanese Mandarin so long as Taiwanese Mandarin corpus dataset is adequately large. In this thesis, we propose a corpus crawler that automatically collects Taiwanese Mandarin audio and transcript dataset from YouTube videos. By utilizing the Closed Captioning subtitle in YouTube videos, the design of the crawler is greatly simplified, which helps to improve the speed of the crawler. In addition, several pre-processing tasks are performed to resolve the issue of context-dependent pronunciation, including removal of unnecessary content and identification of correct pronunciation of Arabic numerals. The proposed crawler is adopted to collect 30 hours of Taiwanese Mandarin corpus dataset, which are used to aid the training of Deep Speech, a well-known speech recognition model, to improve the Deep Speech model. The experimental results show that the linear increase of the dataset results in better-than-linear decrease of the average character and word error rates.
[1]2000 hub5 english evaluation transcripts - linguistic data consortium.https://catalog.ldc.upenn.edu/LDC2002T43.
[2]The ami corpus.http://www.openslr.org/16.
[3]The association for computational linguistics and chinese language processing.http://www.aclclp.org.tw/use_mat_c.php#mat160.[4]Automatic speech recognition data collection with youtube v3 api,mask-rcnn and google vision api.https://towardsdatascience.com/automatic-speech-recognition-data-collection-with-youtube-v3-api-mask-rcnn-and-google-vision-api-2370d6776109.
[5]Avidemux - main page.http://avidemux.sourceforge.net/.
[6]Csr-i (wsj0) complete.https://catalog.ldc.upenn.edu/LDC93S6A.
[7]ffdshow tryouts | official website.http://ffdshow-tryout.sourceforge.net/.
[8]Free speech... recognition (linux, windows and mac) - voxforge.org.http://www.voxforge.org/.
[9]Free st american english corpus.http://www.openslr.org/45.
[10]Kdenlive | libre video editor.https://kdenlive.org/.
[11]Librispeech asr corpus.http://www.openslr.org/12.
[12]Mplayer - the movie player.http://www.mplayerhq.hu.
[13]Tatoeba: Collection of sentences and translations.https://tatoeba.org/.
[14]Vlc: Official site - free multimedia solutions for all os! - videolan.https://www.videolan.org/.
[15]xine - a free video player - home.xine-AFreeVideoPlayer-Home.
[16]Youtube.https://www.youtube.com.
[17]youtube-dl.https://youtube-dl.org/.
[18]youtube_dl 2021.4.26 on pypi - libraries.io.https://libraries.io/pypi/youtube_dl.
[19]R. Anup and L. Rob. Rfc2326: Real time streaming protocol (rtsp), 1998.
[20]Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, M. Kohler, JoshMeyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber.Common voice: A massively-multilingual speech corpus. InLREC, 2020.
[21]Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, JoshMeyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber.Common voice: A massively-multilingual speech corpus, 2020.
[22]T. Berners-Lee, R. Fielding, and H. Frystyk. Rfc1945: Hypertext transfer protocol– http/1.0, 1996.
[23]H. Bu, J. Du, X. Na, B. Wu, and H. Zheng. Aishell-1: An open-source mandarinspeech corpus and a speech recognition baseline. In2017 20th Conference of theOriental Chapter of the International Coordinating Committee on Speech Databasesand Speech I/O Systems and Assessment (O-COCOSDA), pages 1–5, 2017.
[24]Chia-Chen Chen, Tien-Chi Huang, James J. Park, Huang-Hua Tseng, and Neil Y.Yen. A smart assistant toward product-awareness shopping.Personal and UbiquitousComputing, 18(2):339–349, Feb 2014.
[25]Robert L. Cheng. A comparison of taiwanese, taiwan mandarin, and peking man-darin.Language, 61(2):352–377, 1985.
[26]P R Cohen and S L Oviatt. The role of voice input for human-machine communica-tion.Proceedings of the National Academy of Sciences, 92(22):9921–9927, 1995.
[27]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
[28]Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen,Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y.Ng. Deep speech: Scaling up end-to-end speech recognition, 2014.
[29]Kenneth Heafield. KenLM: Faster and smaller language model queries. InProceedingsof the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh,Scotland, July 2011. Association for Computational Linguistics.
[30]Lucas Jo and Wonkyum Lee. goodatlas/zeroth.https://github.com/goodatlas/zeroth.
[31]Michael I. Jordan. Chapter 25 - serial order: A parallel distributed processing ap-proach. In John W. Donahoe and Vivian Packard Dorsel, editors,Neural-NetworkModels of Cognition, volume 121 ofAdvances in Psychology, pages 471–495. North-Holland, 1997.
[32]Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International Conference on Learning Representations, 12 2014.
[33]Yun-Hsuan Kuo. New dialect formation: The case of taiwanese mandarin. 01 2005.
[34]Egor Lakomkin, Sven Magg, Cornelius Weber, and Stefan Wermter. KT-speech-crawler: Automatic dataset construction for speech recognition from YouTubevideos. InProceedings of the 2018 Conference on Empirical Methods in Natural Lan-guage Processing: System Demonstrations, pages 90–95, Brussels, Belgium, Novem-ber 2018. Association for Computational Linguistics.
[35]Lantian Li, Ruiqi Liu, Jiawen Kang, Yue Fan, Hao Cui, Yunqi Cai, RavichanderVipperla, Thomas Fang Zheng, and Dong Wang. Cn-celeb: multi-genre speakerrecognition, 2020.
[36]Zhang De Liang. Deep neural network for chinese speech recognition. Master’s thesis,2015.
[37]Josh Meyer. Multi-task and transfer learning in low-resource speech recognition,2019.
[38]Clément Le Moine and Nicolas Obin. Att-hack: An expressive speech database withsocial attitudes, 2020.
[39]V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpusbased on public domain audio books. In2015 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015.
[40]Md. Wahidur Rahman, Rahabul Islam, Md. Mahmodul Hasan, Shisir Mia, and Mohammad Motiur Rahman. Iot based smart assistant for blind person and smart homeusing the bengali language.SN Computer Science, 1(5):300, Sep 2020.
[41]Anthony Rousseau, Paul Deléglise, and Yannick Estève. TED-LIUM: an automaticspeech recognition dedicated corpus. InProceedings of the Eighth International Con-ference on Language Resources and Evaluation (LREC’12), pages 125–129, Istanbul,Turkey, May 2012. European Language Resources Association (ELRA).
[42]D. E. Rumelhart and J. L. McClelland.Learning Internal Representations by ErrorPropagation, pages 318–362. 1987.
[43]M. Schuster and K.K. Paliwal. Bidirectional recurrent neural networks.IEEE Trans-actions on Signal Processing, 45(11):2673–2681, 1997.
[44]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017.
[45]Dong Wang and Xuewei Zhang. Thchs-30 : A free chinese speech corpus, 2015.
[46]YU Zong WU Yang. An extended hybrid end-to-end chinese speech recognition modelbased on cnn.Journal of Qingdao University of Science and Technology(NaturalScience Edition), 041(001):104–109,118, 2020.