跳到主要內容

簡易檢索 / 詳目顯示

研究生: 廖盈傑
Ying-Jie Liao
論文名稱: 高效率e-mail作者驗證演算法之研究
An Efficient Algorithm For e-mail Authorship Verification
指導教授: 許秉瑜
Ping-Yu Hsu
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 企業管理學系
Department of Business Administration
畢業學年度: 97
語文別: 中文
論文頁數: 72
中文關鍵詞: 電子郵件資料探勘作者鑑定n-grams
外文關鍵詞: e-mail, data mining, Authorship Identification, n-grams
相關次數: 點閱:9下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現今e-mail為人們傳遞訊息的主要媒介,但e-mail帶來便利的同時,也衍生了許多安全性的問題,如盜用帳號、竊取資料等網路犯罪事件層出不窮。因此迫切需要能有效鑑定e-mail來源是否可靠的方法。
    作者鑑定(Authorship Identification)為根據文章的寫作風格特徵(style features),而提供最有可能的作者之方法。應用於e-mail上,則可藉由判別可疑e-mail之寫作風格特徵,得知是否出自於原作者。但目前針對e-mail作者鑑定的研究並不多,所提出的方法皆有效率低落的缺點,甚至有些方法只能適用於特定情境之下,因而有適用性差的缺點。
    本研究所提出的UserProtector 演算法,為同時兼顧高效率和高適用性的方法。由於其他研究所的方法皆須對整封e-mail進行掃描,而UserProtector只需對信件標題掃描即可,因此具有效率高的優點。另外,因採用Character n-grams作衍伸的方法萃取風格特徵,因此各種情境下皆能有效萃取風格特徵,因此具有適用性高的優點。


    Nowadays, people use E-mail as the main media to transfer messages. However, while e-mail is convenience for people, it also brings out many problems of security. Internet crimes like account usurping, data stealing are getting worse. Therefore, a method to identify e-mail sources efficiently is urgently necessary.
    Authorship Identification can base on the style features of articles to provide the most possible writers. It can be used to identify the original writer by judging dubious style features of an e-mail. But, there aren’t many researches that focus on identifying e-mail writers right now. They all have a chief defect of low efficiency. Moreover, some of them can be only used in specific circumstances. Hence, a defect of low suitability appears as well.
    To take both high efficiency and suitability into consideration, this research provides an algorithm: UserProtector. Duo to other methods need to scan all content of one e-mail, UserProtector only scan the e-mail subject. Consequently, it has an advantage of high efficiency. Further, by evolving Character n-grams to extract style features, every kind of circumstances can be extracted style features efficiently. For this reason, it has an advantage of high suitability.

    目錄 頁次 中文摘要 …………………………………………………………………… Ⅰ 英文摘要 …………………………………………………………………… Ⅱ 目錄 …………………………………………………………………… Ⅲ 圖目錄 …………………………………………………………………… Ⅵ 表目錄 …………………………………………………………………….Ⅶ 一、 緒論.…………………………………………………….………………..1 1.1 研究動機 …………………………………………..……………… ...1 1.2 研究目的 ……………………………………………………………. 2 1.3 論文架構 ……………………………………………………………. 4 二、文獻探討.....…..….……………………………………………………….. 6 2.1 作者鑑識(Authorship Identification) ………………………………. . 6 2.2 Character n-grams …………………………………………………… 8 2.3 e-mail作者鑑識 …………………………………………………… 10 2.4 結語 ………………………………………………………………... 13 三、演算法 ..…..…..………………………………………………………… 16 3.1 UserProtector 演算法 …………………………………………….. 16 3.1.1 訓練方法(Training method) ……………………………………. .16 3.1.2 即時騙局偵測方法(Real time fraud detection method) ………. 26 3.2 多重n-grams之萃取 (第1~6行) ……………………………….. 29 3.3 過濾共通慣用語 (第15~34行) …………………………………. 30 3.4 取得風格特徵集(style features set) 〖SF〗_i (第7~14行) …………… 31 3.5 決定各n-grams之權重 (第35~44行) ……………………….…. 32 3.6 取得門檻值 (第45~57行) …………………………………….… 33 3.7 即時騙局偵測方法 (第58~69行) …………………………….… 37 四、實證分析 .…….….……………………………………………………. 38 4.1 實驗設計 .………………………………………………………….38 4.1.1 實驗一 ………………………………………………………….38 4.1.2 實驗二 ………………………………………………………….43 4.2 實驗結果與分析 …………………………………………………. 44 4.2.1 實驗一 ………………………………………………………….45 4.2.1.1鑑定原作者之準確率(1-α )………………………………….45 4.2.1.2鑑定可疑信件之準確率(1- ) ………………………………48 4.2.1.3 α、 值之關係………………………………………………51 4.2.2 實驗二 ………………………………………………………… 59 五、結論與未來研究建議 ………………………………………………….67 5.1 結論 ………………………………………………………………. 67 5.2 未來研究建議 ……………………………………………………. 68 參考文獻 …………………………………………………………………… 69

    [1] The Economist print edition , : http://www.economist.com/displaystory.cfm?story_id=13416219 , The Economist , Apr 2nd 2009
    [2] CNN print edition, : http://edition.cnn.com/2009/TECH/03/30/ghostnet.cyber.espionage/index.html,CNN, March 30, 2009
    [3] Kjell, B., Addison Woods, W, Frieder O.: Discrimination of authorship using visualization. Information Processing and Management 30:1 (1994).
    [4] Keselj, V., Peng, F., Cercone, N. Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In Proc. of the Conference Pacific Association for Computational Linguistics (2003).
    [5] F Iqbal, R Hadjidj, BCM Fung, M Debbabi, : A novel approach of mining write-prints for authorship attribution in e-mail forensics. Digital Investigation, 2008
    [6] Jianbin Ma Ying Li Guifa Teng Fang Wang Yang Zhao.: Sequential Pattern Mining for Chinese E-mail Authorship Identification. : Innovative Computing Information and Control, 2008. ICICIC ''08. 3rd International Conference on
    [7] B Allison, L Guthrie,: Authorship Attribution of E-Mail Comparing Classifiers Over a New Corpus for Evaluation. Proceedings of LREC, 2008
    [8] Gui-Fa Teng Mao-Sheng Lai Jian-Bin Ma Ying Li .: E-MAIL AUTHORSHIP MINING BASED ON SVM FOR COMPUTER. Machine Learning and Cybernetics, 2004
    [9] O De Vel, A Anderson, M Corney, G Mohay.: Mining Email Content for Author Identification Forensics. ACM Sigmod Record, 2001
    [10] K Calix, M Connors, D Levy, H Manzar, G McCabe, S.: Stylometry for E-mail Author Identification and Authentication. CSIS Research Day, Pace Univ, 2008
    [11] Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26:4 (2000) 471-495.
    [12] Peng, F., Shuurmans, F., Keselj, V.,: Wang, S.: Language Independent Authorship Attribution Using Character Level Language Models. In Proc. of the 10th European Association for Computational Linguistics (2003).
    [13] de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail Content for Author Identification Forensics. SIGMOD Record, 30:4 (2001) 55-64.
    [14] Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-Group Web Forum Messages. IEEE Intelligent Systems, 20:5 (2005) 67-75.
    [15] van Halteren, H.: Linguistic Profiling for Author Recognition and Verification. In Proc. Of the 42nd Annual Meeting of the Association for Computational Linguistics (2004) 199-206.
    [16] Chaski, C.: Empirical Evaluations of Language-based Author Identification Techniques.Forensic Linguistics, 8:1 (2001) 1-65.
    [17]De Vel O. Mining e-mail authorship. Paper presented at the workshop on text mining. In: ACM international conference on knowledge discovery and data mining (KDD); 2000.
    [18] Abbasi A, Chen H. Writeprints: a stylometric approach to identitylevel identification and similarity detection in cyberspace.ACM Transactions on Information Systems March 2008;26(2).
    [19] C. Apte, F. Damerau, and S. Weiss.:Text mining with decision rules and decision trees". In Workshop on Learning from text and the Web, Conference on Automated Learning and Discovery, 1998.
    [20] H. Ng, W. Goh, and K. Low. Feature selection, perceptron learning, and a usability case study for text categorization". In Proc. 20th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR97), pages 67{73, 1997.
    [21] T. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
    [22] Y. Yang and X. Liu. A re-examination of text categorisation methods". In Proc. 22nd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR99), pages 67{73, 1999.
    [23] T. Joachims. Text categorization with support vector machines: Learning with many relevant features". In Proc. European Conf. Machine Learning (ECML''98), pages 137{142, 1998.
    [24] Holmes, D.: The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing, 13:3 (1998) 111-117.
    [25] Burrows, J.F.:Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style. Literary and Linguistic Computing, 2: 61-70. 1987.
    [26] Keselj, V., Peng, F., Cercone, N. Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In Proc. of the Conference Pacific Association for Computational Linguistics (2003).
    [27] Ali ,C. Tunga ,G.: Time-efficient spam e-mail filtering using n-gram models , 2007.
    [28] Abou-Assaleh, T. Cercone, N. Keselj, V. Sweidan, R. : N-gram-based Detection of New Malicious Code, : Computer Software and Applications Conference, 2004. COMPSAC 2004. Proceedings of the 28th Annual International
    [29] Yamamoto, H. Sagisaka, Y.:MULTI-CLASS COMPOSITE N-GRAM BASED ON CONNECTION DIRECTION, Acoustics, Speech, and Signal Processing, 1999. ICASSP ''99. Proceedings., 1999 IEEE International Conference on Publication Date: 15-19 Mar 1999
    [30] De Vel O, Anderson A, Corney M, Mohay G. Mining e-mail content For author identification forensics. SIGMOD Record 2001a; 30(4):55–64.
    [31] I Rigoutsos, T Huynh,: Chung-Kwei: a Pattern-discovery-based System for theAutomatic Identification of Unsolicited E-mail Messages (SPAM),2004
    [32] 林惠聆,陳正倉,:統計學原理(二版),雙葉書局,2001

    QR CODE
    :::