| 研究生: |
陳明萱 Ming-Hsuan Chen |
|---|---|
| 論文名稱: |
改進自注意力機制於神經機器翻譯之研究 |
| 指導教授: |
林熙禎
Shi-Jen Lin |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理學系 Department of Information Management |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 65 |
| 中文關鍵詞: | 神經機器翻譯 、Transformer 、自注意力機制 、Gate機制 、分群演算法 |
| 外文關鍵詞: | Neural Machine Translation, Transformer, Self-Attention Mechanism, Gate Mechanism, Clustering Algorithms |
| 相關次數: | 點閱:13 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
神經機器翻譯任務之目的為透過深度學習模型將來源語言句子轉換為目標語言,同時得以保留來源句子語意及正確句法。近年來常用的模型之一為 Transformer,透過模型中的自注意力機制捕捉句子的全局資訊,在多項自然語言處理任務中表現良好。然而,有研究指出自注意力機制會學到重複資訊,且無法有效學習文本中的局部資訊。因此,本研究針對 Transformer 中的自注意力機制進行改善,分別加入 Gate 機制與 K-means 分群演算法,進而提出 Gated Attention 與 Clustered Attention,其中 Gated Attention 又涵蓋 Top-k % 方法及 Threshold 方法。透過將 Attention Map 集中化,加強模型捕捉局部資訊之能力,藉此學習到更多元的句子關係,提升其翻譯品質。
本研究將 Gated Attention 的 Top-k % 方法與 Threshold 方法,以及 Clustered Attention 應用於中英翻譯任務上,以 BLEU 作為評估指標,分別達到 25.30、24.69 及 24.69。其次,同時採用兩種注意力機制的混合組合模型之最佳結果為 24.88,並未比僅採用單一種方法要來得優秀。在實驗中皆證實本研究提出的改進模型優於原始 Transformer,另外亦表明了只使用一種注意力機制更能夠幫助 Transformer 學習文本資訊,且達到 Attention Map 集中化之目的。
The purpose of Neural Machine Translation (NMT) is to translate a source sentence to a target sentence by deep learning models and to be able to preserve the semantic meaning of the source sentence and have correct syntax as well. Recently, Transformer is one of the commonly used models. It can capture the global information of sentences through the Self-Attention Mechanism and performs well in lots of Natural Language Processing (NLP) tasks. However, some studies have indicated that the Self-Attention Mechanism learns repetitive information and cannot learn local information of texts effectively. Therefore, we modify the Self-attention Mechanism in Transformer and propose Gated Attention and Clustered Attention, by adding Gated Mechanism and K-means clustering algorithm respectively. Moreover, Gated Attention includes Top-k% method and Threshold method. These approaches centralize the Attention Map to made model improve the ability to capture local information and learn more different relationship in sentences. Hence Transformer can provide a higher quality translation.
In this work, we apply Clustered Attention as well as Top-k% method and Threshold method of Gated Attention to Chinese-to-English translation tasks, and then the results are 24.69, 25.30 and 24.69 BLEU, respectively. Secondly, the best result of the hybrid combination model that uses both attention mechanisms at the same time is 24.88 BLEU, which is not better than using a single attention mechanism. In our experiments, we have found that the proposed model outperforms the vanilla Transformer. Furthermore, we have also observed that using only one attention mechanism can help Transformer learn text information better and achieve the goal of Attention Map centralization as well.
Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). OPTICS: Ordering Points to Identify the Clustering Structure. Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 49–60. https://doi.org/10.1145/304182.304187
Arora, P., Deepali, & Varshney, S. (2016). Analysis of K-Means and K-Medoids Algorithm For Big Data. Procedia Computer Science, 78, 507–512. https://doi.org/10.1016/j.procs.2016.02.095
Arthur, D., & Vassilvitskii, S. (2006). k-means++: The Advantages of Careful Seeding. Stanford.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. ArXiv:1607.06450 [Cs, Stat]. http://arxiv.org/abs/1607.06450
Babhulgaonkar, A. R., & Bharad, S. V. (2017). Statistical Machine Translation. 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), 62–67. https://doi.org/10.1109/ICISIM.2017.8122149
Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate. ArXiv:1409.0473 [Cs, Stat]. http://arxiv.org/abs/1409.0473
Chen, B., & Cherry, C. (2014). A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. Proceedings of the Ninth Workshop on Statistical Machine Translation, 362–367. https://doi.org/10.3115/v1/W14-3346
Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. ArXiv:1904.10509 [Cs, Stat]. http://arxiv.org/abs/1904.10509
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734. https://doi.org/10.3115/v1/D14-1179
Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). Multi-Head Attention: Collaborate Instead of Concatenate. ArXiv:2006.16362 [Cs, Stat]. http://arxiv.org/abs/2006.16362
Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language Modeling with Gated Convolutional Networks. ArXiv:1612.08083 [Cs]. http://arxiv.org/abs/1612.08083
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Kdd, 96(34), 226–231.
Garg, A., & Agarwal, M. (2018). Machine Translation: A Literature Review. ArXiv:1901.01122 [Cs]. http://arxiv.org/abs/1901.01122
Gehring, J., Auli, M., Grangier, D., & Dauphin, Y. N. (2017). A Convolutional Encoder Model for Neural Machine Translation. ArXiv:1611.02344 [Cs]. http://arxiv.org/abs/1611.02344
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. ArXiv:1705.03122 [Cs]. http://arxiv.org/abs/1705.03122
Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 315–323. http://proceedings.mlr.press/v15/glorot11a.html
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing Machines. ArXiv:1410.5401 [Cs]. http://arxiv.org/abs/1410.5401
Gu, J., Wang, C., & Zhao, J. (2019). Levenshtein Transformer. ArXiv:1905.11006 [Cs]. http://arxiv.org/abs/1905.11006
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/CVPR.2016.90
He, P., Liu, X., Gao, J., & Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. ArXiv:2006.03654 [Cs]. http://arxiv.org/abs/2006.03654
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ArXiv:1502.03167 [Cs]. http://arxiv.org/abs/1502.03167
Jin, X., & Han, J. (2010). K-Means Clustering. In C. Sammut & G. I. Webb (Eds.), Encyclopedia of Machine Learning (pp. 563–564). Springer US. https://doi.org/10.1007/978-0-387-30164-8_425
Kalchbrenner, N., & Blunsom, P. (2013). Recurrent Continuous Translation Models. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1700–1709. https://www.aclweb.org/anthology/D13-1176
Kaufman, L., & Rousseeuw, P. J. (2009). Finding Groups in Data: An Introduction to Cluster Analysis. New York, NY: John Wiley and Sons.
Kodinariya, T., & Makwana, P. (2013). Review on Determining of Cluster in K-means Clustering. International Journal of Advance Research in Computer Science and Management Studies, 1, 90–95.
Lakew, S. M., Cettolo, M., & Federico, M. (2018). A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation. Proceedings of the 27th International Conference on Computational Linguistics, 641–652. https://www.aclweb.org/anthology/C18-1054
Lample, G., & Conneau, A. (2019). Cross-lingual Language Model Pretraining. ArXiv:1901.07291 [Cs]. http://arxiv.org/abs/1901.07291
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-Based Neural Machine Translation. ArXiv:1508.04025 [Cs]. http://arxiv.org/abs/1508.04025
MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281–297.
Mehdad, Y., Negri, M., & Federico, M. (2012). Match without a Referee: Evaluating MT Adequacy without Reference Translations. Proceedings of the Seventh Workshop on Statistical Machine Translation, 171–180. https://www.aclweb.org/anthology/W12-3122
Meng, F., Lu, Z., Li, H., & Liu, Q. (2016). Interactive Attention for Neural Machine Translation. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2174–2185. https://www.aclweb.org/anthology/C16-1205
Na, S., Xumin, L., & Yong, G. (2010). Research on k-means Clustering Algorithm: An Improved k-means Clustering Algorithm. 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, 63–67. https://doi.org/10.1109/IITSI.2010.74
Okpor, M. D. (2014). Machine Translation Approaches: Issues and Challenges. 11(5), 7.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://doi.org/10.3115/1073083.1073135
Popescu-Belis, A. (2019). Context in Neural Machine Translation: A Review of Models and Evaluations. ArXiv:1901.09115 [Cs]. http://arxiv.org/abs/1901.09115
Raganato, A., Scherrer, Y., & Tiedemann, J. (2020). Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation. ArXiv:2002.10260 [Cs]. http://arxiv.org/abs/2002.10260
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
Rush, A. (2018). The Annotated Transformer. Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 52–60. https://doi.org/10.18653/v1/W18-2509
Singh, S. P., Kumar, A., Darbari, H., Singh, L., Rastogi, A., & Jain, S. (2017). Machine Translation Using Deep Learning: An overview. 2017 International Conference on Computer, Communications and Electronics (Comptelix), 162–167. https://doi.org/10.1109/COMPTELIX.2017.8003957
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. ArXiv:1409.3215 [Cs]. http://arxiv.org/abs/1409.3215
Tan, Z., Wang, S., Yang, Z., Chen, G., Huang, X., Sun, M., & Liu, Y. (2020). Neural Machine Translation: A Review of Methods, Resources, and Tools. ArXiv:2012.15515 [Cs]. http://arxiv.org/abs/2012.15515
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient Transformers: A Survey. ArXiv:2009.06732 [Cs]. http://arxiv.org/abs/2009.06732
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv:1706.03762 [Cs]. http://arxiv.org/abs/1706.03762
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. ArXiv:1905.09418 [Cs]. http://arxiv.org/abs/1905.09418
Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C.-C. J. (2019). Evaluating Word Embedding Models: Methods and Experimental Results. APSIPA Transactions on Signal and Information Processing, 8. https://doi.org/10.1017/ATSIP.2019.12
Wang, Z., Ma, Y., Liu, Z., & Tang, J. (2019). R-Transformer: Recurrent Neural Network Enhanced Transformer. ArXiv:1907.05572 [Cs, Eess]. http://arxiv.org/abs/1907.05572
Wu Y., Schuster M., Chen Z., Le Q. V., Norouzi M., Macherey W., Krikun M., Cao Y., Gao Q., Macherey K., Klingner J., Shah A., Johnson M., Liu X., Kaiser Ł., Gouws S., Kato Y., Kudo T., Kazawa H., … Dean J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://arxiv.org/abs/1609.08144v2
Xin, M., & Wang, Y. (2019). Research on Image Classification Model Based on Deep Convolution Neural Network. EURASIP Journal on Image and Video Processing, 2019(1), 40. https://doi.org/10.1186/s13640-019-0417-8
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical Attention Networks for Document Classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489. https://doi.org/10.18653/v1/N16-1174
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2021). Big Bird: Transformers for Longer Sequences. ArXiv:2007.14062 [Cs, Stat]. http://arxiv.org/abs/2007.14062
林佳蒼(2020)。多向注意力機制於翻譯任務改進之研究。國立中央大學資訊管理研究所碩士論文,桃園市。