| 研究生: |
丁仕杰 Shi-Jie Ding |
|---|---|
| 論文名稱: |
通過語音命令修正實現對話式用戶界面 Toward Conversational User Interface via Voice Command Correction |
| 指導教授: |
張嘉惠
Chia-Hui Chang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 58 |
| 中文關鍵詞: | 語音辨識 、錯誤修正 、語音指令 、自動修正模組 、中文 自然語言處理 |
| 外文關鍵詞: | Automatic Speech Recognition, Error Correction, Voice Command, Automatic Correction System, Spelling error Correction |
| 相關次數: | 點閱:13 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,人工智慧技術迅速發展,語音辨識(ASR)技術亦有顯
著進展,並廣泛應用於對話系統、智慧家電與語音助理等日常場景。
然而,ASR 在實際應用中仍常出現錯誤,特別容易受到發音差異與
同音異字等因素影響,導致辨識結果與原意不符,例如「這個程式很
棒」被誤辨為「這個城市很棒」。
以往的研究大多著重於自動錯誤修正,雖具一定成效,但對於
如人名等專有名詞的修正仍存在挑戰。為此,本研究提出一套基於
語音指令的語音辨識錯誤修正系統,允許使用者透過語音下達「新
增」、「刪除」與「修改」等自然語言指令,達到精確修正辨識結果、
減少鍵盤輸入的目的。
本系統包含三大核心模組:1. 輸入分類器,用以判斷語音輸入為
敘述或指令;2. 指令分類器,辨別指令所屬類型;3. 指令標註器,標
記錯誤位置及對應修改內容。為訓練上述模組,我們採用 SIGHAN-15
與 zh-tw-wikipedia 語料,並以 TTS 與 ASR 技術模擬錯誤,再利用大
型語言模型與中文部件結合常用字詞生成自然指令,模擬真實使用情
境下的修正方式。
實驗結果顯示,原先的兩個模型在各自的資料集上皆能正確修正
超過 80% 的錯誤句子,展現出良好的準確性與容錯能力。我們也嘗試
將兩個資料集進行混合,並訓練出 Model-Mix 模型,其在整體表現上
亦具備穩定且優異的修正能力。此外,我們將系統建置為 API 形式,
提供其他語音辨識應用串接使用,並持續蒐集實際指令資料以優化模
型。我們亦將大型語言模型導入系統,以提升指令理解能力並擴展系
統的應用範圍,並測試 LLM 使否能理解修改指令。
綜上所述,本研究提出一套創新且具實用性的語音辨識錯誤修正
流程,不僅有效解決自動修正機制的限制,也顯著降低使用者的手動
輸入成本。
Recent advances in AI have improved ASR performance, enabling
its widespread use in dialogue systems and smart devices. However, real-
world ASR still struggles with errors caused by pronunciation variations
and homophones.
To address limitations in prior automatic correction methods—
especially with proper nouns and user-specific terms—we propose a
speech-command-based ASR correction system. It allows users to is-
sue natural language voice instructions to refine recognition results and
reduce manual input.
The system consists of three modules: an input classifier to detect
commands, a command classifier to determine instruction type, and a
command labeler to locate correction targets. We train these modules
using data from SIGHAN-15 and zh-tw-wikipedia, simulate ASR errors
via TTS/ASR, and generate realistic correction commands using LLMs
and linguistic features.
Experiments show that the original models each achieved over 80%
correction accuracy, and a combined model maintained strong, stable
performance. The system is deployed as an API for integration with
ASR applications, with real user data continuously collected for opti-
mization. LLMs are also integrated to enhance instruction understand-
ing and expand application scope.
In summary, our method provides a practical, flexible ASR correc-
tion workflow that reduces user effort and improves correction precision.
[1] Kai-Tuo Xu, Feng-Long Xie, Xu Tang, and Yao Hu. Fireredasr:
Open-source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration. arXiv preprint
arXiv:2501.14350, 2025.
[2] Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue,
He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li,
Mingchen Shao, and Lei Xie. Unveiling the potential of llm-based
asr on chinese open-source datasets. In 2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 26–30, 2024.
[3] Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. Automatic speech recognition errors detection and correction: A review. Procedia Computer Science, 128:32–37, 2018. 1st International Conference on Natural Language and Speech Processing.
[4] Lina Zhou, Yongmei Shi, Jinjuan Feng, and Andrew Sears. Data
mining for detecting errors in dictation speech recognition. IEEE
transactions on speech and audio processing, 13(5):681–688, 2005.
[5] Wei Chen, Sankaranarayanan Ananthakrishnan, Rohit Kumar, Rohit Prasad, and Prem Natarajan. Asr error detection in a conversational spoken language translation system. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing,
pages 7418–7422. IEEE, 2013.
[6] Youssef Bassil and Paul Semaan. Asr context-sensitive error
correction based on microsoft n-gram dataset. arXiv preprint
arXiv:1203.5262, 2012.
39
[7] Sukhdeep S. Sodhi, Ellie Ka-In Chio, Ambarish Jash, Santiago
Ontañón, Ajit Apte, Ankit Kumar, Ayooluwakunmi Jeje, Dima
Kuzmin, Harry Fung, Heng-Tze Cheng, Jon Effrat, Tarush Bali,
Nitin Jindal, Pei Cao, Sarvjeet Singh, Senqiang Zhou, Tameen
Khan, Amol Wankhede, Moustafa Alzantot, Allen Wu, and Tushar
Chandra. Mondegreen: A post-processing solution to speech recognition error correction for voice search queries. In Proceedings of the
27th ACM SIGKDD Conference on Knowledge Discovery & Data
Mining, KDD ’21, page 3569–3575, New York, NY, USA, 2021.
Association for Computing Machinery.
[8] Victor Junqiu Wei, Weicheng Wang, Di Jiang, Yuanfeng Song, and
Lu Wang. Asr-ec benchmark: Evaluating large language models on
chinese asr error correction. arXiv preprint arXiv:2412.03075, 2024.
[9] Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang,
Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audiolanguage models. arXiv preprint arXiv:2311.07919, 2023.
[10] Anirudh Mani, Shruti Palaskar, Nimshi Venkat Meripo, Sandeep
Konam, and Florian Metze. Asr error correction and domain adaptation using machine translation. In ICASSP 2020 - 2020 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6344–6348, 2020.
[11] Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao,
Xuefei Liu, and Zhengqi Wen. End-to-end spelling correction conditioned on acoustic feature for code-switching speech recognition.
In Interspeech, pages 266–270, 2021.
[12] Jin Jiang, Xiaojun Wan, Wei Peng, Rongjun Li, Jingyuan Yang, and
Yanquan Zhou. Cross modal training for asr error correction with
contrastive learning. In ICASSP 2024 - 2024 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 12246–12250, 2024.
40
[13] Ling Dong, Wenjun Wang, Zhengtao Yu, Yuxin Huang, Junjun
Guo, and Guojiang Zhou. Pronunciation guided copy and correction model for asr error correction. International Journal of Machine
Learning and Cybernetics, 15(10):4787–4799, 2024.
[14] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv
preprint arXiv:1910.13461, 2019.
[15] Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan
Bulyko, and Andreas Stolcke. Generative speech recognition error
correction with large language models and task-activating prompting. In 2023 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU). IEEE, December 2023.
[16] Yuang Li, Xiaosong Qiao, Xiaofeng Zhao, Huan Zhao, Wei Tang,
Min Zhang, and Hao Yang. Large language model should understand pinyin for chinese asr error correction. In ICASSP 2025-2025
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 1–5. IEEE, 2025.
[17] Zhiyuan Tang, Dong Wang, Shen Huang, and Shidong Shang. Fulltext error correction for chinese speech recognition with large language model. In ICASSP 2025-2025 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
IEEE, 2025.
[18] Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil Mirylenka, and
Aliaksei Severyn. Encode, tag, realize: High-precision text editing.
In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,
Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), pages 5054–5065,
Hong Kong, China, November 2019. Association for Computational
Linguistics.
41
[19] Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. EdiT5: Semi-autoregressive text editing with t5 warm-start.
In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,
Findings of the Association for Computational Linguistics: EMNLP
2022, pages 2126–2138, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
[20] Charles Hinson, Hen-Hsen Huang, and Hsin-Hsi Chen. Heterogeneous recycle generation for chinese grammatical error correction.
In Proceedings of the 28th International Conference on Computational Linguistics, pages 2191–2201, 2020.
[21] Lai Jiang, Hongqiu Wu, Hai Zhao, and Min Zhang. Chinese spelling
corrector is just a language learner. In Findings of the Association
for Computational Linguistics ACL 2024, pages 6933–6943, 2024.
[22] Linfeng Liu, Hongqiu Wu, and Hai Zhao. Chinese spelling correction
as rephrasing language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18662–18670, 2024.
[23] Yinghui Li, Shang Qin, Haojing Huang, Yangning Li, Libo Qin,
Xuming Hu, Wenhao Jiang, Hai-Tao Zheng, and Philip S Yu. Rethinking the roles of large language models in chinese grammatical
error correction. arXiv preprint arXiv:2402.11420, 2024.
[24] Hongqiu Wu, Shaohua Zhang, Yuchen Zhang, and Hai Zhao. Rethinking masked language modeling for chinese spelling correction.
arXiv preprint arXiv:2305.17721, 2023.
[25] Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang,
Ding Zhang, Li Yangning, Ruiyang Liu, Zhongli Li, Yunbo Cao,
Haitao Zheng, and Ying Shen. Linguistic rules-based corpus generation for native Chinese grammatical error correction. In Yoav
Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of
the Association for Computational Linguistics: EMNLP 2022, pages
576–589, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
42
[26] Vladimir I Levenshtein et al. Binary codes capable of correcting
deletions, insertions, and reversals. In Soviet physics doklady, pages
707–710. Soviet Union, 1966.
[27] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional transformers
for language understanding. In Proceedings of the 2019 conference
of the North American chapter of the association for computational
linguistics: human language technologies, volume 1 (long and short
papers), pages 4171–4186, 2019.