跳到主要內容

簡易檢索 / 詳目顯示

研究生: 丁仕杰
Shi-Jie Ding
論文名稱: 通過語音命令修正實現對話式用戶界面
Toward Conversational User Interface via Voice Command Correction
指導教授: 張嘉惠
Chia-Hui Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 58
中文關鍵詞: 語音辨識錯誤修正語音指令自動修正模組中文 自然語言處理
外文關鍵詞: Automatic Speech Recognition, Error Correction, Voice Command, Automatic Correction System, Spelling error Correction
相關次數: 點閱:13下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,人工智慧技術迅速發展,語音辨識(ASR)技術亦有顯
    著進展,並廣泛應用於對話系統、智慧家電與語音助理等日常場景。
    然而,ASR 在實際應用中仍常出現錯誤,特別容易受到發音差異與
    同音異字等因素影響,導致辨識結果與原意不符,例如「這個程式很
    棒」被誤辨為「這個城市很棒」。
    以往的研究大多著重於自動錯誤修正,雖具一定成效,但對於
    如人名等專有名詞的修正仍存在挑戰。為此,本研究提出一套基於
    語音指令的語音辨識錯誤修正系統,允許使用者透過語音下達「新
    增」、「刪除」與「修改」等自然語言指令,達到精確修正辨識結果、
    減少鍵盤輸入的目的。
    本系統包含三大核心模組:1. 輸入分類器,用以判斷語音輸入為
    敘述或指令;2. 指令分類器,辨別指令所屬類型;3. 指令標註器,標
    記錯誤位置及對應修改內容。為訓練上述模組,我們採用 SIGHAN-15
    與 zh-tw-wikipedia 語料,並以 TTS 與 ASR 技術模擬錯誤,再利用大
    型語言模型與中文部件結合常用字詞生成自然指令,模擬真實使用情
    境下的修正方式。
    實驗結果顯示,原先的兩個模型在各自的資料集上皆能正確修正
    超過 80% 的錯誤句子,展現出良好的準確性與容錯能力。我們也嘗試
    將兩個資料集進行混合,並訓練出 Model-Mix 模型,其在整體表現上
    亦具備穩定且優異的修正能力。此外,我們將系統建置為 API 形式,
    提供其他語音辨識應用串接使用,並持續蒐集實際指令資料以優化模
    型。我們亦將大型語言模型導入系統,以提升指令理解能力並擴展系
    統的應用範圍,並測試 LLM 使否能理解修改指令。
    綜上所述,本研究提出一套創新且具實用性的語音辨識錯誤修正
    流程,不僅有效解決自動修正機制的限制,也顯著降低使用者的手動
    輸入成本。


    Recent advances in AI have improved ASR performance, enabling
    its widespread use in dialogue systems and smart devices. However, real-
    world ASR still struggles with errors caused by pronunciation variations
    and homophones.
    To address limitations in prior automatic correction methods—
    especially with proper nouns and user-specific terms—we propose a
    speech-command-based ASR correction system. It allows users to is-
    sue natural language voice instructions to refine recognition results and
    reduce manual input.
    The system consists of three modules: an input classifier to detect
    commands, a command classifier to determine instruction type, and a
    command labeler to locate correction targets. We train these modules
    using data from SIGHAN-15 and zh-tw-wikipedia, simulate ASR errors
    via TTS/ASR, and generate realistic correction commands using LLMs
    and linguistic features.
    Experiments show that the original models each achieved over 80%
    correction accuracy, and a combined model maintained strong, stable
    performance. The system is deployed as an API for integration with
    ASR applications, with real user data continuously collected for opti-
    mization. LLMs are also integrated to enhance instruction understand-
    ing and expand application scope.
    In summary, our method provides a practical, flexible ASR correc-
    tion workflow that reduces user effort and improves correction precision.

    摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 圖目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 表目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 一、 緒論 . . . . . . . . . . . . . . . . . . . . . . . . 1 二、 相關文獻 . . . . . . . . . . . . . . . . . . . . . . 4 2-1 語音辨識錯誤修正 (ASR error correction) . . . . 4 2-2 拼寫錯誤修正 (Spelling Error Correction) . . . . 7 三、 研究方法及步驟 . . . . . . . . . . . . . . . . . . 12 3-1 系統架構 (System Structure) . . . . . . . . . . . 12 3-1-1 輸入分類器 (Input Classifier) . . . . . . . . . . . 12 3-1-2 指令分類器 (Command Classifier) . . . . . . . . 13 3-1-3 指令標註器 (Command Labeler) . . . . . . . . . 13 3-2 資料集準備 (Preparation of Dataset) . . . . . . 14 3-2-1 指令資料生成 (Command Data Generation) . . 15 3-3 各模組資料類型 (Type of Data in each Module) 16 3-3-1 輸入類型資料集 (Input Classifier Dataset) . . . 16 3-3-2 指令類型資料集 (Command Classifier Dataset) . 16 3-3-3 指令標註器資料集 (Command Labeler Dataset) 17 四、 實驗 . . . . . . . . . . . . . . . . . . . . . . . . 18 4-1 實驗設置 . . . . . . . . . . . . . . . . . . . . . . 18 4-2 輸入分類器 (Input Classifier) . . . . . . . . . . . 18 4-3 指令分類器 . . . . . . . . . . . . . . . . . . . . 19 4-4 指令標註器 (Command Labeler) . . . . . . . . . 19 4-5 系統整體效能 (Overall Performance of Our System) 21 iii 4-6 語音指令辨識錯誤的影響(The Effect of ASR Errors in Voice Commands) . . . . . . . . . . . 22 五、 真實使用情境 . . . . . . . . . . . . . . . . . . . 24 5-1 建置 API . . . . . . . . . . . . . . . . . . . . . 24 5-2 實際應用 . . . . . . . . . . . . . . . . . . . . . . 26 5-3 修正速度比較 . . . . . . . . . . . . . . . . . . . 28 六、 使用 LLM 進行指令修正 . . . . . . . . . . . . . 30 6-1 使用 LLM 進行指令修正 . . . . . . . . . . . . . 30 6-1-1 輸入分類器 (LLM Input Classifier) . . . . . . . 31 6-1-2 錯誤修改器 (LLM Error Corrector) . . . . . . . 31 6-2 實驗 & 測試案例 . . . . . . . . . . . . . . . . . 32 6-2-1 輸入分類器效能 (LLM Input Classifier Performance) . . . . . . . . . . . . . . . . . . . . . . . 32 6-2-2 Case Study: 操作指令 . . . . . . . . . . . . . . 33 6-2-3 錯誤修正器效能 (Performance of LLM Error Corrector) . . . . . . . . . . . . . . . . . . . . . . . 35 6-2-4 推論時間比較 . . . . . . . . . . . . . . . . . . . 36 6-2-5 Case Study: 不同敘述風格的指令 . . . . . . . . 37 七、 結論和未來展望 . . . . . . . . . . . . . . . . . . 38 索引 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 八、 附錄 . . . . . . . . . . . . . . . . . . . . . . . . 44 8-1 指令生成提示詞 . . . . . . . . . . . . . . . . . . 44 8-2 輸入分類器 (LLM Input Classifier) 提示詞 . . . 45 8-3 錯誤修正器 (LLM Error Corrector) 提示詞 . . . 45 8-4 人工標註資料 & 大語言模型預測結果 . . . . . . 46

    [1] Kai-Tuo Xu, Feng-Long Xie, Xu Tang, and Yao Hu. Fireredasr:
    Open-source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration. arXiv preprint
    arXiv:2501.14350, 2025.
    [2] Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue,
    He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li,
    Mingchen Shao, and Lei Xie. Unveiling the potential of llm-based
    asr on chinese open-source datasets. In 2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 26–30, 2024.
    [3] Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. Automatic speech recognition errors detection and correction: A review. Procedia Computer Science, 128:32–37, 2018. 1st International Conference on Natural Language and Speech Processing.
    [4] Lina Zhou, Yongmei Shi, Jinjuan Feng, and Andrew Sears. Data
    mining for detecting errors in dictation speech recognition. IEEE
    transactions on speech and audio processing, 13(5):681–688, 2005.
    [5] Wei Chen, Sankaranarayanan Ananthakrishnan, Rohit Kumar, Rohit Prasad, and Prem Natarajan. Asr error detection in a conversational spoken language translation system. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing,
    pages 7418–7422. IEEE, 2013.
    [6] Youssef Bassil and Paul Semaan. Asr context-sensitive error
    correction based on microsoft n-gram dataset. arXiv preprint
    arXiv:1203.5262, 2012.
    39
    [7] Sukhdeep S. Sodhi, Ellie Ka-In Chio, Ambarish Jash, Santiago
    Ontañón, Ajit Apte, Ankit Kumar, Ayooluwakunmi Jeje, Dima
    Kuzmin, Harry Fung, Heng-Tze Cheng, Jon Effrat, Tarush Bali,
    Nitin Jindal, Pei Cao, Sarvjeet Singh, Senqiang Zhou, Tameen
    Khan, Amol Wankhede, Moustafa Alzantot, Allen Wu, and Tushar
    Chandra. Mondegreen: A post-processing solution to speech recognition error correction for voice search queries. In Proceedings of the
    27th ACM SIGKDD Conference on Knowledge Discovery & Data
    Mining, KDD ’21, page 3569–3575, New York, NY, USA, 2021.
    Association for Computing Machinery.
    [8] Victor Junqiu Wei, Weicheng Wang, Di Jiang, Yuanfeng Song, and
    Lu Wang. Asr-ec benchmark: Evaluating large language models on
    chinese asr error correction. arXiv preprint arXiv:2412.03075, 2024.
    [9] Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang,
    Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audiolanguage models. arXiv preprint arXiv:2311.07919, 2023.
    [10] Anirudh Mani, Shruti Palaskar, Nimshi Venkat Meripo, Sandeep
    Konam, and Florian Metze. Asr error correction and domain adaptation using machine translation. In ICASSP 2020 - 2020 IEEE
    International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6344–6348, 2020.
    [11] Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao,
    Xuefei Liu, and Zhengqi Wen. End-to-end spelling correction conditioned on acoustic feature for code-switching speech recognition.
    In Interspeech, pages 266–270, 2021.
    [12] Jin Jiang, Xiaojun Wan, Wei Peng, Rongjun Li, Jingyuan Yang, and
    Yanquan Zhou. Cross modal training for asr error correction with
    contrastive learning. In ICASSP 2024 - 2024 IEEE International
    Conference on Acoustics, Speech and Signal Processing (ICASSP),
    pages 12246–12250, 2024.
    40
    [13] Ling Dong, Wenjun Wang, Zhengtao Yu, Yuxin Huang, Junjun
    Guo, and Guojiang Zhou. Pronunciation guided copy and correction model for asr error correction. International Journal of Machine
    Learning and Cybernetics, 15(10):4787–4799, 2024.
    [14] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv
    preprint arXiv:1910.13461, 2019.
    [15] Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan
    Bulyko, and Andreas Stolcke. Generative speech recognition error
    correction with large language models and task-activating prompting. In 2023 IEEE Automatic Speech Recognition and Understanding
    Workshop (ASRU). IEEE, December 2023.
    [16] Yuang Li, Xiaosong Qiao, Xiaofeng Zhao, Huan Zhao, Wei Tang,
    Min Zhang, and Hao Yang. Large language model should understand pinyin for chinese asr error correction. In ICASSP 2025-2025
    IEEE International Conference on Acoustics, Speech and Signal
    Processing (ICASSP), pages 1–5. IEEE, 2025.
    [17] Zhiyuan Tang, Dong Wang, Shen Huang, and Shidong Shang. Fulltext error correction for chinese speech recognition with large language model. In ICASSP 2025-2025 IEEE International Conference
    on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
    IEEE, 2025.
    [18] Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil Mirylenka, and
    Aliaksei Severyn. Encode, tag, realize: High-precision text editing.
    In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,
    Proceedings of the 2019 Conference on Empirical Methods in Natural
    Language Processing and the 9th International Joint Conference on
    Natural Language Processing (EMNLP-IJCNLP), pages 5054–5065,
    Hong Kong, China, November 2019. Association for Computational
    Linguistics.
    41
    [19] Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. EdiT5: Semi-autoregressive text editing with t5 warm-start.
    In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,
    Findings of the Association for Computational Linguistics: EMNLP
    2022, pages 2126–2138, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
    [20] Charles Hinson, Hen-Hsen Huang, and Hsin-Hsi Chen. Heterogeneous recycle generation for chinese grammatical error correction.
    In Proceedings of the 28th International Conference on Computational Linguistics, pages 2191–2201, 2020.
    [21] Lai Jiang, Hongqiu Wu, Hai Zhao, and Min Zhang. Chinese spelling
    corrector is just a language learner. In Findings of the Association
    for Computational Linguistics ACL 2024, pages 6933–6943, 2024.
    [22] Linfeng Liu, Hongqiu Wu, and Hai Zhao. Chinese spelling correction
    as rephrasing language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18662–18670, 2024.
    [23] Yinghui Li, Shang Qin, Haojing Huang, Yangning Li, Libo Qin,
    Xuming Hu, Wenhao Jiang, Hai-Tao Zheng, and Philip S Yu. Rethinking the roles of large language models in chinese grammatical
    error correction. arXiv preprint arXiv:2402.11420, 2024.
    [24] Hongqiu Wu, Shaohua Zhang, Yuchen Zhang, and Hai Zhao. Rethinking masked language modeling for chinese spelling correction.
    arXiv preprint arXiv:2305.17721, 2023.
    [25] Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang,
    Ding Zhang, Li Yangning, Ruiyang Liu, Zhongli Li, Yunbo Cao,
    Haitao Zheng, and Ying Shen. Linguistic rules-based corpus generation for native Chinese grammatical error correction. In Yoav
    Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of
    the Association for Computational Linguistics: EMNLP 2022, pages
    576–589, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
    42
    [26] Vladimir I Levenshtein et al. Binary codes capable of correcting
    deletions, insertions, and reversals. In Soviet physics doklady, pages
    707–710. Soviet Union, 1966.
    [27] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
    Toutanova. Bert: Pre-training of deep bidirectional transformers
    for language understanding. In Proceedings of the 2019 conference
    of the North American chapter of the association for computational
    linguistics: human language technologies, volume 1 (long and short
    papers), pages 4171–4186, 2019.

    QR CODE
    :::