跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳明萱
Ming-Hsuan Chen
論文名稱: 改進自注意力機制於神經機器翻譯之研究
指導教授: 林熙禎
Shi-Jen Lin
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 65
中文關鍵詞: 神經機器翻譯Transformer自注意力機制Gate機制分群演算法
外文關鍵詞: Neural Machine Translation, Transformer, Self-Attention Mechanism, Gate Mechanism, Clustering Algorithms
相關次數: 點閱:13下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   神經機器翻譯任務之目的為透過深度學習模型將來源語言句子轉換為目標語言,同時得以保留來源句子語意及正確句法。近年來常用的模型之一為 Transformer,透過模型中的自注意力機制捕捉句子的全局資訊,在多項自然語言處理任務中表現良好。然而,有研究指出自注意力機制會學到重複資訊,且無法有效學習文本中的局部資訊。因此,本研究針對 Transformer 中的自注意力機制進行改善,分別加入 Gate 機制與 K-means 分群演算法,進而提出 Gated Attention 與 Clustered Attention,其中 Gated Attention 又涵蓋 Top-k % 方法及 Threshold 方法。透過將 Attention Map 集中化,加強模型捕捉局部資訊之能力,藉此學習到更多元的句子關係,提升其翻譯品質。
      本研究將 Gated Attention 的 Top-k % 方法與 Threshold 方法,以及 Clustered Attention 應用於中英翻譯任務上,以 BLEU 作為評估指標,分別達到 25.30、24.69 及 24.69。其次,同時採用兩種注意力機制的混合組合模型之最佳結果為 24.88,並未比僅採用單一種方法要來得優秀。在實驗中皆證實本研究提出的改進模型優於原始 Transformer,另外亦表明了只使用一種注意力機制更能夠幫助 Transformer 學習文本資訊,且達到 Attention Map 集中化之目的。


    The purpose of Neural Machine Translation (NMT) is to translate a source sentence to a target sentence by deep learning models and to be able to preserve the semantic meaning of the source sentence and have correct syntax as well. Recently, Transformer is one of the commonly used models. It can capture the global information of sentences through the Self-Attention Mechanism and performs well in lots of Natural Language Processing (NLP) tasks. However, some studies have indicated that the Self-Attention Mechanism learns repetitive information and cannot learn local information of texts effectively. Therefore, we modify the Self-attention Mechanism in Transformer and propose Gated Attention and Clustered Attention, by adding Gated Mechanism and K-means clustering algorithm respectively. Moreover, Gated Attention includes Top-k% method and Threshold method. These approaches centralize the Attention Map to made model improve the ability to capture local information and learn more different relationship in sentences. Hence Transformer can provide a higher quality translation.
    In this work, we apply Clustered Attention as well as Top-k% method and Threshold method of Gated Attention to Chinese-to-English translation tasks, and then the results are 24.69, 25.30 and 24.69 BLEU, respectively. Secondly, the best result of the hybrid combination model that uses both attention mechanisms at the same time is 24.88 BLEU, which is not better than using a single attention mechanism. In our experiments, we have found that the proposed model outperforms the vanilla Transformer. Furthermore, we have also observed that using only one attention mechanism can help Transformer learn text information better and achieve the goal of Attention Map centralization as well.

    摘要 ..................................................................................................................i Abstract .........................................................................................................ii 誌謝 ................................................................................................................iii 目錄 ................................................................................................................iv 圖目錄 ............................................................................................................vi 表目錄 ...........................................................................................................vii 一、前言 .........................................................................................................1 1-1 研究背景 ................................................................................................1 1-2 研究動機 ................................................................................................2 1-3 研究目的 ................................................................................................3 1-4 文章架構 ................................................................................................4 二、文獻探討 .................................................................................................5 2-1 神經機器翻譯 ........................................................................................5 2-2 編解碼器架構 ........................................................................................6 2-2-1 RNN ...................................................................................................7 2-2-2 LSTM ..................................................................................................7 2-2-3 RNN Encoder-Decoder .................................................................9 2-3 Transformer .......................................................................................10 2-3-1 詞向量 ..............................................................................................11 2-3-2 殘差連結與層正規 ..........................................................................12 2-3-3 FFN ..................................................................................................13 2-3-4 線性層與 Softmax .........................................................................14 2-4 注意力機制 ..........................................................................................14 2-4-1 自注意力機制 ..................................................................................15 2-4-2 多向注意力機制 ..............................................................................17 2-4-3 自注意力機制相關研究 ..................................................................17 2-5 分群演算法 (Clustering Algorithm) ................................................20 2-5-1 K-means .........................................................................................20 2-5-2 K值選擇 ...........................................................................................21 三、研究方法 ..............................................................................................23 3-1 資料前處理 .........................................................................................24 3-2 模型訓練 .............................................................................................26 3-2-1 Attention Map .............................................................................26 3-2-2 Gated Attention ..........................................................................27 3-2-3 Clustered Attention ....................................................................29 3-2-4 多向注意力機制 ............................................................................31 3-3 結果評估 ............................................................................................32 3-3-1 產生翻譯句子 ................................................................................32 3-3-2 計算 BLEU .....................................................................................33 四、實驗 ....................................................................................................35 4-1 實驗設置 ...........................................................................................35 4-1-1 實驗環境與參數設置 ...................................................................35 4-1-2 資料集 ...........................................................................................36 4-2 實驗設計與結果 ...............................................................................37 4-2-1 實驗一:不同超參數設置下之模型表現 ...................................37 4-2-2 實驗二:Gated Attention 與 Clustered Attention 之效能 ..40 4-2-3 實驗三:不同 Attention Heads 組合下之模型表現 ...............41 4-3 討論與分析 ........................................................................................43 4-3-1 Attention Map 之分析 ................................................................43 4-3-2 最佳 K 值之分析 ............................................................................44 五、結論與未來方向 .................................................................................46 5-1 結論 ....................................................................................................46 5-2 研究限制 ............................................................................................46 5-3 未來研究方向 ....................................................................................46 參考文獻 .....................................................................................................48

    Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). OPTICS: Ordering Points to Identify the Clustering Structure. Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 49–60. https://doi.org/10.1145/304182.304187
    Arora, P., Deepali, & Varshney, S. (2016). Analysis of K-Means and K-Medoids Algorithm For Big Data. Procedia Computer Science, 78, 507–512. https://doi.org/10.1016/j.procs.2016.02.095
    Arthur, D., & Vassilvitskii, S. (2006). k-means++: The Advantages of Careful Seeding. Stanford.
    Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. ArXiv:1607.06450 [Cs, Stat]. http://arxiv.org/abs/1607.06450
    Babhulgaonkar, A. R., & Bharad, S. V. (2017). Statistical Machine Translation. 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), 62–67. https://doi.org/10.1109/ICISIM.2017.8122149
    Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate. ArXiv:1409.0473 [Cs, Stat]. http://arxiv.org/abs/1409.0473
    Chen, B., & Cherry, C. (2014). A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. Proceedings of the Ninth Workshop on Statistical Machine Translation, 362–367. https://doi.org/10.3115/v1/W14-3346
    Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. ArXiv:1904.10509 [Cs, Stat]. http://arxiv.org/abs/1904.10509
    Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734. https://doi.org/10.3115/v1/D14-1179
    Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). Multi-Head Attention: Collaborate Instead of Concatenate. ArXiv:2006.16362 [Cs, Stat]. http://arxiv.org/abs/2006.16362
    Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language Modeling with Gated Convolutional Networks. ArXiv:1612.08083 [Cs]. http://arxiv.org/abs/1612.08083
    Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Kdd, 96(34), 226–231.
    Garg, A., & Agarwal, M. (2018). Machine Translation: A Literature Review. ArXiv:1901.01122 [Cs]. http://arxiv.org/abs/1901.01122
    Gehring, J., Auli, M., Grangier, D., & Dauphin, Y. N. (2017). A Convolutional Encoder Model for Neural Machine Translation. ArXiv:1611.02344 [Cs]. http://arxiv.org/abs/1611.02344
    Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. ArXiv:1705.03122 [Cs]. http://arxiv.org/abs/1705.03122
    Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 315–323. http://proceedings.mlr.press/v15/glorot11a.html
    Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing Machines. ArXiv:1410.5401 [Cs]. http://arxiv.org/abs/1410.5401
    Gu, J., Wang, C., & Zhao, J. (2019). Levenshtein Transformer. ArXiv:1905.11006 [Cs]. http://arxiv.org/abs/1905.11006
    He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/CVPR.2016.90
    He, P., Liu, X., Gao, J., & Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. ArXiv:2006.03654 [Cs]. http://arxiv.org/abs/2006.03654
    Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
    Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ArXiv:1502.03167 [Cs]. http://arxiv.org/abs/1502.03167
    Jin, X., & Han, J. (2010). K-Means Clustering. In C. Sammut & G. I. Webb (Eds.), Encyclopedia of Machine Learning (pp. 563–564). Springer US. https://doi.org/10.1007/978-0-387-30164-8_425

    Kalchbrenner, N., & Blunsom, P. (2013). Recurrent Continuous Translation Models. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1700–1709. https://www.aclweb.org/anthology/D13-1176
    Kaufman, L., & Rousseeuw, P. J. (2009). Finding Groups in Data: An Introduction to Cluster Analysis. New York, NY: John Wiley and Sons.
    Kodinariya, T., & Makwana, P. (2013). Review on Determining of Cluster in K-means Clustering. International Journal of Advance Research in Computer Science and Management Studies, 1, 90–95.
    Lakew, S. M., Cettolo, M., & Federico, M. (2018). A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation. Proceedings of the 27th International Conference on Computational Linguistics, 641–652. https://www.aclweb.org/anthology/C18-1054
    Lample, G., & Conneau, A. (2019). Cross-lingual Language Model Pretraining. ArXiv:1901.07291 [Cs]. http://arxiv.org/abs/1901.07291
    Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-Based Neural Machine Translation. ArXiv:1508.04025 [Cs]. http://arxiv.org/abs/1508.04025
    MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281–297.
    Mehdad, Y., Negri, M., & Federico, M. (2012). Match without a Referee: Evaluating MT Adequacy without Reference Translations. Proceedings of the Seventh Workshop on Statistical Machine Translation, 171–180. https://www.aclweb.org/anthology/W12-3122
    Meng, F., Lu, Z., Li, H., & Liu, Q. (2016). Interactive Attention for Neural Machine Translation. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2174–2185. https://www.aclweb.org/anthology/C16-1205
    Na, S., Xumin, L., & Yong, G. (2010). Research on k-means Clustering Algorithm: An Improved k-means Clustering Algorithm. 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, 63–67. https://doi.org/10.1109/IITSI.2010.74

    Okpor, M. D. (2014). Machine Translation Approaches: Issues and Challenges. 11(5), 7.
    Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://doi.org/10.3115/1073083.1073135
    Popescu-Belis, A. (2019). Context in Neural Machine Translation: A Review of Models and Evaluations. ArXiv:1901.09115 [Cs]. http://arxiv.org/abs/1901.09115
    Raganato, A., Scherrer, Y., & Tiedemann, J. (2020). Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation. ArXiv:2002.10260 [Cs]. http://arxiv.org/abs/2002.10260
    Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
    Rush, A. (2018). The Annotated Transformer. Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 52–60. https://doi.org/10.18653/v1/W18-2509
    Singh, S. P., Kumar, A., Darbari, H., Singh, L., Rastogi, A., & Jain, S. (2017). Machine Translation Using Deep Learning: An overview. 2017 International Conference on Computer, Communications and Electronics (Comptelix), 162–167. https://doi.org/10.1109/COMPTELIX.2017.8003957
    Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. ArXiv:1409.3215 [Cs]. http://arxiv.org/abs/1409.3215
    Tan, Z., Wang, S., Yang, Z., Chen, G., Huang, X., Sun, M., & Liu, Y. (2020). Neural Machine Translation: A Review of Methods, Resources, and Tools. ArXiv:2012.15515 [Cs]. http://arxiv.org/abs/2012.15515
    Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient Transformers: A Survey. ArXiv:2009.06732 [Cs]. http://arxiv.org/abs/2009.06732
    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv:1706.03762 [Cs]. http://arxiv.org/abs/1706.03762
    Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. ArXiv:1905.09418 [Cs]. http://arxiv.org/abs/1905.09418
    Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C.-C. J. (2019). Evaluating Word Embedding Models: Methods and Experimental Results. APSIPA Transactions on Signal and Information Processing, 8. https://doi.org/10.1017/ATSIP.2019.12
    Wang, Z., Ma, Y., Liu, Z., & Tang, J. (2019). R-Transformer: Recurrent Neural Network Enhanced Transformer. ArXiv:1907.05572 [Cs, Eess]. http://arxiv.org/abs/1907.05572
    Wu Y., Schuster M., Chen Z., Le Q. V., Norouzi M., Macherey W., Krikun M., Cao Y., Gao Q., Macherey K., Klingner J., Shah A., Johnson M., Liu X., Kaiser Ł., Gouws S., Kato Y., Kudo T., Kazawa H., … Dean J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://arxiv.org/abs/1609.08144v2
    Xin, M., & Wang, Y. (2019). Research on Image Classification Model Based on Deep Convolution Neural Network. EURASIP Journal on Image and Video Processing, 2019(1), 40. https://doi.org/10.1186/s13640-019-0417-8
    Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical Attention Networks for Document Classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489. https://doi.org/10.18653/v1/N16-1174
    Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2021). Big Bird: Transformers for Longer Sequences. ArXiv:2007.14062 [Cs, Stat]. http://arxiv.org/abs/2007.14062
    林佳蒼(2020)。多向注意力機制於翻譯任務改進之研究。國立中央大學資訊管理研究所碩士論文,桃園市。

    QR CODE
    :::