跳到主要內容

簡易檢索 / 詳目顯示

研究生: 余昌翰
Chang-Han Yu
論文名稱: 基於Transformer及姿態辨識之即時手語翻譯系統
The Real-Time Sign Language Translation System Based on Transformer and Pose Estimation
指導教授: 蘇木春
Mu-Chun Su
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 79
中文關鍵詞: 深度學習自然語言處理影像處理電腦視覺
外文關鍵詞: Deep Learning, Natural Language Processing, Image Processing, Computer Vision
相關次數: 點閱:12下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 根據台灣衛生福利部2021年的統計,
    在台灣領有身心障礙證明者約為119.8萬人,其中有聽覺機能障礙者有12萬5764人,
    約佔總人口數的5\%。而聽障者們常會因自幼的聽力障礙而導致口語發音、學習上有許多困難,
    故以手語作為他們主要的溝通方式。

    而在現今,有許多手語慣用者在閱讀電視新聞、選舉辯論及直播記者會等
    需大量仰賴聽覺吸收資訊的媒體時,常僅能以字幕的形式進行閱讀,其中之選舉辯論、
    防疫記者會等政府舉辦之公共節目常會配有手語翻譯員,
    手語翻譯員會將主講者中文口語的內容轉換至手語的形式比劃給視聽者觀看,
    進而使手語慣用者能夠更輕鬆的理解媒體內之內容。
    但因手語翻譯員的數量仍有限,故僅能配置在少數的場合。
    於是,如何使聽障者能與一般閱聽者擁有同等的使用體驗,
    為現代媒體目前遇到的一項重大課題。

    本研究結合深度學習中的兩大領域,自然語言處理以及姿態辨識領域的技術,
    開發出了一套能及時進行手語翻譯並使用虛擬人物比劃手語手勢的系統,
    運用3D姿態辨識的模型將手語的單字影片轉化為手勢數據資料集,
    運用第三方語音辨識服務辨識使用者口語轉換至中文句子,
    並且利用自然語言處理模型將中文句子轉換為手語單字序列,
    並將手語單字序列與手勢數據資料集進行比對,
    進而將正確的手語手勢傳遞給虛擬人物,讓其進行比劃手語手勢,
    再串接所有階段成一完整的使用者系統。
    使其可進行即時翻譯手語的系統。

    此外,本研究也實驗並應用多種訊號平滑化的技術,
    改善姿態辨識常有的Temporal Jitter問題,
    使虛擬人物進行手語手勢時能更貼近真人。


    According to statistics from the Taiwan's Ministry of Health and Welfare in 2021,
    There are about 1,198,000 people with physical disability certificates in Taiwan,
    including 125,764 people with hearing impairment.
    About 5% of the total population.
    The hearing-impaired people often have many difficulties in oral pronunciation and learning due to the hearing impairment since childhood.
    Therefore, sign language is often used as its main communication method.

    And many sign language users are reading TV news, election debates and live press conferences, etc.
    Media that relies heavily on hearing to absorb information can often only be read in subtitles, including election debates,
    Regular meetings of public programs organized by the government such as epidemic prevention press conferences
    Equipped with a sign language teacher, the sign language teacher will convert the content of the speaker's spoken language into the form of sign language,
    Make it easier for sign language users to understand the content.
    However, because the number of sign language teachers is still limited, they can only be deployed in a few occasions. then,
    How to enable the hearing-impaired to have the same experience as ordinary listeners,
    A major issue for modern media.

    This research combines technologies from two major fields in deep learning, natural language processing and gesture recognition.
    Developed a system that can perform sign language translation in time and use virtual characters to make sign language gestures,
    Using the 3D gesture recognition model to convert the single-word video of sign language into a gesture data set,
    Use a third-party speech recognition service to recognize the user's spoken language and convert it into Chinese sentences,
    And use the natural language processing model to convert Chinese sentences into sign language word sequences,
    And compared the sign language word sequence with the gesture data set,

    Then, the correct sign language gestures are passed to the avatar, so that they can make sign language gestures,
    Then connect all the stages into a complete user system.
    A system that enables instant interpretation of sign language.

    In addition, this study also experimented and applied a variety of signal smoothing techniques,
    Improve the Temporal Jitter problem common in gesture recognition,
    The virtual characters can be closer to real people when they perform sign language gestures.

    摘要i Abstract iii 目錄v 一、緒論1 1.1 研究動機.................................................................. 1 1.2 研究目的.................................................................. 2 1.3 論文架構.................................................................. 3 二、背景知識以及文獻回顧4 2.1 背景知識.................................................................. 4 2.1.1 台灣手語......................................................... 4 2.1.2 Transformer ...................................................... 4 2.1.3 BERT .............................................................. 5 2.1.4 孿生網路......................................................... 7 2.1.5 Sentence-BERT .................................................. 8 2.1.6 Unity............................................................... 9 2.1.7 姿態辨識......................................................... 10 2.1.8 MediaPipe Hand ................................................. 10 2.1.9 卡爾曼濾波器................................................... 11 2.1.10 One Euro Filter................................................... 12 2.2 文獻回顧.................................................................. 13 2.2.1 深度學習應用於手語辨識之相關研究..................... 13 2.2.2 3D 手部姿態辨識之相關研究................................ 13 2.2.3 翻譯口語至手語之相關研究................................. 14 三、系統介紹及研究方法17 3.1 系統架構.................................................................. 17 3.2 資料調整工具............................................................ 19 3.2.1 內部系統介紹................................................... 19 3.2.2 取得節點修正後軌跡.......................................... 21 3.2.3 即時測試結果................................................... 22 3.3 文字處理系統............................................................ 22 3.3.1 內部系統介紹................................................... 22 3.3.2 輸入資訊處理................................................... 23 3.4 使用者系統............................................................... 23 3.5 資料傳輸方式............................................................ 25 3.6 系統輸入詞前處理...................................................... 25 3.6.1 系統句型翻譯................................................... 25 3.6.2 斷詞及詞性分類................................................ 26 3.6.3 單字序列與資料集匹配判斷................................. 27 3.6.4 語義辨識資料集................................................ 27 3.7 手勢資料集收集及儲存方式.......................................... 28 3.8 降低Temporal jitter...................................................... 29 3.8.1 濾波器篩選...................................................... 29 3.8.2 X、Y 軸.......................................................... 29 3.8.3 Z 軸................................................................ 30 四、實驗設計以及成果32 4.1 Sentence-BERT 訓練結果.............................................. 32 4.2 Temporal Jitter 優化效果............................................... 36 4.3 手語輸出序列準確度實驗結果....................................... 47 4.4 視覺化成效調查......................................................... 49 五、總結52 5.1 結論........................................................................ 52 5.2 未來展望.................................................................. 53 參考文獻55 附錄A 實用手語教材- 範例對話57

    [1] 統計處. “身心障礙統計專區.” (Jul. 2021), [Online]. Available: https://dep.mohw.gov.
    tw/dos/cp-5224-62359-113.html (visited on 06/09/2022).
    [2] “Speech-to-Text:自動語音辨識| Cloud 語音轉文字,” [Online]. Available: https :
    //cloud.google.com/speech-to-text?hl=zh-tw (visited on 05/19/2022).
    [3] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” arXiv, Tech. Rep.,
    2017.
    [4] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese
    BERT-Networks,” arXiv, Tech. Rep. arXiv:1908.10084, Aug. 2019.
    [5] A. Juliani, V. Berges, E. Vckay, et al., “Unity: A general platform for intelligent agents,”
    CoRR, vol. abs/1809.02627, 2018.
    [6] F. Zhang, V. Bazarevsky, A. Vakunov, et al., “MediaPipe Hands: On-device Real-time
    Hand Tracking,” arXiv, Tech. Rep. arXiv:2006.10214, Jun. 2020.
    [7] R. E. Kalman. “卡爾曼濾波.” (Aug. 2021), [Online]. Available: https://zh.wikipedia.
    org/w/index.php?title=%E5%8D%A1%E5%B0%94%E6%9B%BC%E6%BB%A4%
    E6%B3%A2&oldid=67182863 (visited on 06/09/2022).
    [8] G. Casiez, N. Roussel, and D. Vogel, “1€ Filter: A Simple Speed-based Low-pass Filter
    for Noisy Input in Interactive Systems,” Conference on Human Factors in Computing
    Systems - Proceedings, pp. 2527–2530, May 2012.
    [9] 台灣手語, zh-Hant-TW, Nov. 2021.
    [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional
    Transformers for Language Understanding,” arXiv, Tech. Rep. arXiv:1810.04805,
    May 2019.
    [11] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using
    a ”siamese” time delay neural network,” in Advances in Neural Information Processing
    Systems, vol. 6, Morgan-Kaufmann, 1993.
    [12] N. Kasukurthi, B. Rokad, S. Bidani, and D. A. Dennisan, “American Sign Language
    Alphabet Recognition using Deep Learning,” arXiv, Tech. Rep. arXiv:1905.05487, May
    2019.
    [13] “美國手語.” zh-Hant-TW. (Dec. 2020), [Online]. Available: https://zh.wikipedia.org/
    w/index.php?title=美國手語&oldid=63043570 (visited on 06/29/2022).
    [14] S. He, “Research of a Sign Language Translation System Based on Deep Learning,” in
    2019 International Conference on Artificial Intelligence and Advanced Manufacturing
    (AIAM), Oct. 2019, pp. 392–396.
    [15] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Realtime multiperson
    2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis
    and Machine Intelligence, vol. 43, no. 1, pp. 172–186, 2021.
    [16] S. A. Ehssan Aly, A. Hassanin, and S. Bekhet, “Esldl: An integrated deep learning model
    for egyptian sign language recognition,” in 2021 3rd Novel Intelligent and Leading Emerging
    Sciences Conference (NILES), 2021, pp. 331–335.
    [17] W. Cheng, J. H. Park, and J. H. Ko, “Handfoldingnet: A 3d hand pose estimation network
    using multiscale-feature guided folding of a 2d hand skeleton,” CoRR, vol. abs/
    2108.05545, 2021.
    [18] U. Iqbal, P. Molchanov, T. M. Breuel, J. Gall, and J. Kautz, “Hand pose estimation via
    latent 2.5d heatmap regression,” CoRR, vol. abs/1804.09534, 2018.
    [19] M. Boulares and M. Jemni, “Mobile sign language translation system for deaf community,”
    in Proceedings of the International Cross-Disciplinary Conference on Web Accessibility,
    ser. W4A ’12, Lyon, France: Association for Computing Machinery, 2012.
    [20] S. Stoll, N. C. Camgoz, S. Hadfield, and R. Bowden, “Text2Sign: Towards Sign Language
    Production Using Neural Machine Translation and Generative Adversarial Networks,”
    en, International Journal of Computer Vision, vol. 128, no. 4, pp. 891–908, Apr.
    2020.
    [21] Thadeu Luz. “Using AI for Sign Language Translation.” (Mar. 2020), [Online]. Available:
    https://www.youtube.com/watch?v=N0Vm0LXmcU4 (visited on 06/29/2022).
    [22] Hand Talk Translator–Apps on Google Play, zh-TW.
    [23] 孫聖然. “北京冬奧|央視推AI 手語主播助聽障人士觀賽適應快語速識專有詞.”
    (Feb. 2022), [Online]. Available: https://www.hk01.com/即時中國/732025/北京冬奧-
    央視推ai手￿ª￿主播助聽障人士觀賽-適應快語速識專有詞(visited on 05/18/2022).
    [24] STSbenchmark - stswiki.
    [25] Huertas97, Multilingual-STSB, Mar. 2022.
    [26] 張榮興. “實用臺灣手語教材,” [Online]. Available: https : / / www . books . com . tw /
    products/0010882503 (visited on 06/25/2022).
    [27] “SentenceTransformers Documentation —Sentence-Transformers documentation,” [Online].
    Available: https://www.sbert.net/ (visited on 06/26/2022).

    QR CODE
    :::