| 研究生: |
童張銓 Chang-Chuan Tung |
|---|---|
| 論文名稱: |
利用BERT語言模型辨識社群媒體資源之資安威脅預警系統 To identify cybersecurity threat of social media and notification solution by BERT |
| 指導教授: |
陳奕明
Yi-Ming Chen |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理學系在職專班 Executive Master of Information Management |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 63 |
| 中文關鍵詞: | BERT 、深度學習 、資訊安全 、自然語言處理 、命名實體識別 |
| 外文關鍵詞: | BERT, Deep learning, cybersecurity, Natural Language Processing, Named Entity Recognition |
| 相關次數: | 點閱:12 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
不斷提高網路威脅意識及建立預防機制為確保企業資訊安全的一項重要任務,企業內部的網路及資安專家必須能夠即時獲得與企業內部軟硬體相關之最新安全事件及資安威脅的訊息,以進一步在資安事件發生前提早採取相對應措施,此一過程仰賴企業內部資安專家所獲得訊息的來源範圍及工作效率,處於被動接收訊息之模式。隨著社群媒體的發展及相關開源情報的廣泛使用,Twitter等社群媒體亦提供了資安威脅事件的最新訊息,其即時性及平台上所包含之訊息數量預期將能彌補訊息來源匱乏及人工效率之不足處,
本研究即希望透過收集Twitter社群媒體上的最新資安威脅事件,匯集眾多與資安威脅相關之關鍵字進行自然語言處理,並透過專有名詞識別出與電腦軟硬體相關之命名實體標籤,並且與企業內部現行所使用之軟硬體環境進行比對,再進一步提供使用者或管理者相關之資安威脅事件期能提早採取因應措施。
本研究為使用資訊安全相關之關鍵字收集Twitter平台之內容後利用多層級雙向編碼技術(Bidirectional Encoder Representations from Transformers, BERT)及進行微調,再以命名實體標籤識別出電腦軟硬體之廠商、系統名稱、版本、威脅等專有名詞,並以此與現有之電腦環境進行比對並發送預警訊息給使用者或管理者,以達到即時偵測及告警之目的。
本研究並與其他學者所提出之方法進行比較,實驗結果顯示本研究所採用之BERT優於多位學者曾提出之CNN+BiLSTM機器學習方法,本研究之方法於Precision, Recall, F1 Score皆可達到96%以上,且可依據上下文正確識別出未在訓練集內之單詞,以達到正確標示及即時預警之目的。
Continuously promoting the awareness of cybersecurity threats and establishing the preventive methods are important measures to ensure the cyber security for an enterprise.
Cybersecurity experts in the enterprise must be able to sense the newest vulnerabilities and threats in the virtual environment. The information identifying and collecting process relies on the source range the experts hold and the work efficiency of the personnel, in which the data is received passively and time consuming.
With the development of social media and the open-source intelligence such as Twitter, brings the instant updates and concern of cybersecurity to the public, and its immediacy and the post amount on the platform are expected to make up for the lack of sources and handling efficiency.
This research is expected to provide notification for users and managers to early response measures by collecting cybersecurity information on Twitter and through machine learning to identify related entity of software or hardware, and compared with the current virtual environment.
This research collects keywords of cybersecurity on Twitter and being processed by the BERT (Bidirectional Encoder Representations from Transformers) for named entity recognition to identify vendor, software, version and relevant term, and compare with the existing environment to send the warning message for users and managers to achieve the purpose of real-time detection and warning.
In this research, F1-Score is 0.96 and it is superior to CNN+BiLSTM, and BERT can correctly identify words that are not in the training set according to the context, to achieve the purpose of correct identity and immediate warning.
[1] Check Point Research, “Cyber Attacks Increased 50% Year over Year,” Jan. 10, 2022. Retrieved from https://blog.checkpoint.com/2022/01/10/check-point-research-cyber-attacks-increased-50-year-over-year/
[2] CVE, “CVE-2021-xxxxxx records.” Retrieved from https://cve.mitre.org/data/downloads/allitems-cvrf-year-2021.xml
[3] CVE, “CVE-2021-44228,” Nov. 2021. Retrieved from https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228
[4] Defense Advanced Research Projects Agency, “Red Balloon Challenge,” Dec. 2009. Retrieved from https://www.darpa.mil/about-us/timeline/network-challenge
[5] “Twitter Usage Statistics.” Retrieved from https://www.internetlivestats.com/twitter-statistics/
[6] We are social, “DIGITAL 2022: ANOTHER YEAR OF BUMPER GROWTH,” Jan. 2022. Retrieved from https://wearesocial.com/uk/blog/2022/01/digital-2022-another-year-of-bumper-growth-2/
[7] Trend Micro HK, “Hunting Threats on Twitter: How Social Media can be Used to Gather Actionable Threat Intelligence,” Jul. 30, 2019. Retrieved from https://www.trendmicro.com/vinfo/hk/security/news/cybercrime-and-digital-threats/hunting-threats-on-twitter
[8] T. Sakaki et al., “Earthquake shakes Twitter users: real-time event detection by social sensors,” in Proceedings of the 19th international conference on World wide web - WWW ’10, 2010, pp. 851–860.
[9] O. Oh et al., “Information control and terrorism: Tracking the Mumbai terrorist attack through twitter,” Information Systems Frontiers, vol. 13, no. 1, pp. 33–43, Mar. 2011.
[10] M. Imran et al., “Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages,” 2016.
[11] R. A. Bridges et al., “Automatic Labeling for Entity Extraction in Cyber Security,” Computer Science, pp. 258–261, 2013.
[12] Chia-Wei Wu, “Cyber Security Vulnerabilities Alert System Based on Information from Twitter and CVE,” National Chung Hsing University, 2019.
[13] F. Schaurer and J. Störger, “The evolution of open source intelligence (OSINT),” Comput Hum Behav, vol. 19, pp. 53–56, 2013.
[14] T. Dokman and T. Ivanjko, “Open Source Intelligence (OSINT): issues and trends,” 2020.
[15] Hamilton Bean, “The DNI’s Open Source Center: An Organizational Communication Perspective,” International Journal of Intelligence and CounterIntelligence, pp. 240–257, Feb. 2007.
[16] Australian government: Federal register of legislation, “Office of National Assessments Act 1977,” Dec. 02, 2005. Retrieved from https://www.legislation.gov.au/Details/C2005C00687
[17] J. Bian et al., “Towards large-scale twitter mining for drug-related adverse events,” in Proceedings of the 2012 international workshop on Smart health and wellbeing - SHB ’12, 2012, p. 25.
[18] N. Alsaedi et al., “Can We Predict a Riot? Disruptive Event Detection Using Twitter,” ACM Trans. Internet Technol., vol. 17, no. 2, pp. 1–26, May 2017.
[19] P. Galán-GarcÍa et al., “Supervised machine learning for the detection of troll profiles in twitter social network: application to a real case of cyberbullying,” Logic Jnl IGPL, pp. 42–53, Oct. 2015.
[20] R. P. Khandpur et al., “Crowdsourcing Cybersecurity: Cyber Attack Detection using Social Media,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Nov. 2017, pp. 1049–1057.
[21] A. Agarwal et al., “Sentiment Analysis of Twitter Data,” in Proceedings of the Workshop on Language in Social Media (LSM 2011), Jun. 2011, pp. 30–38.
[22] V. Mulwad et al., “Extracting Information about Security Vulnerabilities from Web Text,” in 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Aug. 2011, pp. 257–260.
[23] A. Joshi et al., “Extracting Cybersecurity Related Linked Data from Text,” in 2013 IEEE Seventh International Conference on Semantic Computing, Irvine, Sep. 2013, pp. 252–259.
[24] S. Weerawardhana et al., “Automated Extraction of Vulnerability Information for Home Computer Security,” in Foundations and Practice of Security, vol. 8930, 2015, pp. 356–366.
[25] R. Collobert et al., “Natural Language Processing (almost) from Scratch,” 2011.
[26] Z. Huang et al., “Bidirectional LSTM-CRF Models for Sequence Tagging,” 2015.
[27] H. Gasmi et al., “LSTM recurrent neural networks for cybersecurity named entity recognition,” Nov. 2018.
[28] I. Mazharov and B. Dobrov, “Named Entity Recognition for Information Security Domain,” in DAMDID/RCDL, Oct. 2018, pp. 200–207.
[29] Y. Qin et al., “A network security entity recognition method based on feature template and CNN-BiLSTM-CRF,” Frontiers Inf Technol Electronic Eng, vol. 20, no. 6, pp. 872–884, Jun. 2019.
[30] M. Tikhomirov et al., “Using BERT and Augmentation in Named Entity Recognition for Cybersecurity Domain,” in Natural Language Processing and Information Systems, vol. 12089, 2020, pp. 16–24.
[31] “Machine Learning Crash Course.” Retrieved from https://developers.google.com/machine-learning/crash-course/embeddings/translating-to-a-lower-dimensional-space
[32] “Word2Vec - Google Code.” Retrieved from https://code.google.com/archive/p/word2vec/
[33] J. Pennington et al., “GloVe: Global Vectors for Word Representation.” Retrieved from https://nlp.stanford.edu/projects/glove/
[34] M. E. Peters et al., “Deep contextualized word representations,” 2018.
[35] Z. Huang and W. Zhao, “Combination of ELMo Representation and CNN Approaches to Enhance Service Discovery,” IEEE Access, vol. 8, pp. 130782–130796, 2020.
[36] A. Vaswani et al., “Attention Is All You Need,” 2017.
[37] M. Banik et al., “Phone Segmentation for Japanese Triphthong Using Neural Networks,” in 2011 Eighth International Conference on Information Technology: New Generations, Apr. 2011, pp. 470–475.
[38] Y. Kim, “Convolutional Neural Networks for Sentence Classification,” 2014.
[39] T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45.
[40] J. Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2018.
[41] A. Radford et al., “Improving language understanding by generative pre-training,” 2018.
[42] A. M. Dai and Q. V. Le, “Semi-supervised Sequence Learning,” Advances in neural information processing systems, vol. 28, pp. 3079–3087, Nov. 2015.
[43] Z. Dai et al., “Named Entity Recognition Using BERT BiLSTM CRF for Chinese Electronic Health Records,” in 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Oct. 2019, pp. 1–5.
[44] E. Popova and V. Spitsyn, “Sentiment Analysis of Short Russian Texts Using BERT and Word2Vec Embeddings,” in Proceedings of the 31th International Conference on Computer Graphics and Vision., 2021, vol. 2, pp. 1011–1016.
[45] Tavis Ormandy, Aug. 13, 2019. Retrieved from https://twitter.com/taviso
[46] S. B. Edward and L. Ewan Klein, “Natural Language Toolkit,” Apr. 20, 2021. Retrieved from https://www.nltk.org/
[47] L. A. Ramshaw and M. P. Marcus, “Text Chunking using Transformation-Based Learning,” 1995.
[48] J. Pennington et al., “GloVe: Global Vectors for Word Representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Oct. 2014, pp. 1532–1543.
[49] T. Mikolov et al., “Efficient Estimation of Word Representations in Vector Space.”, Sep. 2013.
[50] E. Huang et al., “Improving Word Representations via Global Context and Multiple Word Prototypes,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2012, pp. 873–882.
[51] M. Faruqui et al., “Problems With Evaluation of Word Embeddings Using Word Similarity Tasks,” 2016.
[52] H. He and Y. Ma, Eds., Imbalanced learning: foundations, algorithms, and applications., 2013.
[53] FUDHOLI, D. H. et al., “A Hybrid CNN-BilLSTM Model for Drug Named Entity Recognition,” Journal of Engineering Science and Technology, vol. 17, no. 1, pp. 0730–0744, Feb. 2022.
[54] M. Cho et al., “Combinatorial feature embedding based on CNN and LSTM for biomedical named entity recognition,” Journal of Biomedical Informatics, vol. 103, p. 103381, Mar. 2020.
[55] H. Hubková, “Named-entity recognition in Czech historical texts: Using a CNN-BiLSTM neural network model.,” Uppsala universitet, 2019.
[56] CVE, May 17, 2022. Retrieved from https://twitter.com/CVEnew/status/1526529246910455810
[57] CVE, May 14, 2022. Retrieved from https://twitter.com/CVEnew/status/1392913890149900294