跳到主要內容

簡易檢索 / 詳目顯示

研究生: 張順誠
Shun-cheng Chang
論文名稱: 運用改良式BM25演算法於程式碼搜尋之查詢擴展技術
Using New BM25 Alogorithm for Query Expansion in Code Search
指導教授: 林熙禎
shi-jen Lin
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
畢業學年度: 98
語文別: 中文
論文頁數: 100
中文關鍵詞: 程式碼搜尋查詢擴展BM25
外文關鍵詞: Query Expansion, Code Search, BM25
相關次數: 點閱:25下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在過去的程式碼搜尋研究中,大多利用傳統TFIDF公式來計算詞彙權重,然而程式碼與一般文件不同,程式碼擁有高度的結構性。本研究是利用過去在資訊檢索(Information Retrieval)領域中,最常被使用的查詢擴展(Query Expansion)技術及其假定相關回饋(Blind Feedback)技術為研究基礎,以統一建模語言(Unified Modeling Language, UML)觀點來改良BM25(Best Match 25)公式,並結合動態擴展詞演算法,發展出可應用於程式碼搜尋的查詢擴展技術。
    本研究達到的貢獻有以下四點:(1) 運用查詢擴展技術於程式碼搜尋,以解決「概念性查詢詞搜尋偏差」的問題。(2) 依UML類別圖觀點定義程式碼中類別間的關係,提出改良式BM25演算法,利用該方式可運用在任何物件導向程式語言。(3) 依檢索系統評估中F-measure定義,提出動態擴展詞演算法,利用該方法,可依不同查詢詞動態找出須加入的擴展詞數量。(4) 透過本研究所發展的程式碼搜尋系統,可符合新進程式開發者的思考模式,有效提升系統開發效率。
    最後經本研究實驗結果證明,本系統可將查準率由原來的30%左右,提升到88%左右,藉此證實本研究所提出的方法確實可讓新進程式開發者更容易找到相關的程式碼。


    The code searches in past studies, we used traditional TFIDF formula to calculate the weights of term and general documents. However, it is different with word documents, the code is more high-level structural.Query Expansion and Blind Feedback technology are the most commonly used in the domain of information retrieval. Based on UML (Unified Modeling Language) perspective, we adopt Query Expansion and Blind Feedback technology to improve the formula of BM25 (Best Match 25). In the mean time, we combine them with Dynamic Expansion Term Algorithm to develop query expansion in searching codes.
    In this study, our contributions are the followings:(1) Using Query Expansion in searching codes, we can improve the searching bias of conceptual query.(2) UML class diagram in accordance with the definition of the code point of view the relationship between classes, we improve the formula of BM25 and apply it to object-oriented programming language.(3) Depending on the definition of retrieval system in F-measure, we develop Dynamic Expansion Term Algorithm to find out more accuracy expansion term immediately in any query words.(4) In our code searching system, we can help the beginners in programming and improve the work efficiency.
    According to the final results of this study, it shows that this system can raise the precision rate from 30% to 88%. And it shows that our method does help the beginners in programming and find out the code in detail.

    中 文 摘 要 I Abstract II 誌謝 III 目錄 IV 圖目錄 VII 表目錄 X 公式目錄 XII 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 2 1.3 研究目的 4 1.4 研究方法 5 1.5 論文架構 6 第二章 文獻探討 7 2.1程式碼搜尋引擎 7 2.2 程式碼剖析 9 2.3 UML類別圖與程式碼對應的三種關係 11 2.1.1 一般化關係 12 2.1.2 關聯關係 13 2.1.3 相依關係 14 2.1.4 三種關係的權重設定原則 15 2.4程式碼權重計算 15 2.4.1 程式碼索引詞權重計算 15 2.4.2 TFIDF權重計算公式 17 2.4.3 BM25權重計算公式 19 2.5 查詢擴展技術 20 2.6 擴展詞數 23 2.7 小結 26 第三章 系統設計與架構 27 3.1 系統架構 27 3.2 程式碼下載 28 3.3 程式碼剖析 29 3.4 擴展詞權重計算 33 3.5 擴展詞減化 37 第四章 實驗結果與討論 39 4.1 系統實作與案例說明 39 4.2 評估準則 44 4.3 實驗設計 48 4.4 實驗結果 49 4.4.1 實驗一、Google Code Search基礎查詢 49 4.4.2 實驗二、改良式TFIDF演算法查詢擴展 51 4.4.3 實驗三、改良式BM25演算法查詢擴展 56 4.4.4 實驗四、Google Code Search基礎查詢、改良式TFIDF演算法與改良式BM25演算法系統效能比較 61 4.4.6 小結 70 4.5相關研究比較 72 第五章 結論與未來研究方向 74 5.1 結論 74 5.2 未來研究方向 75 參考文獻 77 中文部份 77 英文部份 77 網頁資料 84

    中文部份
    1. 洪菁憶(2008).循序探勘在軟體版本控制上的應用.未發表的碩士論文.中壢:中央大學資訊管理研究所。
    2. 廖振傑(2009).藉由資料探勘的排序方式提昇程式碼搜尋品質─ 以Koders 為例.第20屆國際資訊管理學術研討會.台北:世新大學。
    英文部份
    3. Bajracharya, S., Ossher, J., & Lopes, C. (2009). Sourcerer: An internet-scale software repository. Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation, 1-4.
    4. Fischer, G., Lusiardi, J., & von Gudenberg, J. W. (2007). Abstract syntax trees - and their role in model driven software development. Software Engineering Advances, 2007. ICSEA 2007. International Conference on, 38-38.
    5. Forward, A. (2007). CodeSnippets plug-in to eclipse: Introducing web 2.0 tagging to improve software developer recall. Software Engineering Research, Management and Applications, ACIS International Conference on, 0 451-460.
    6. Frakes, W. B., & Nejmeh, B. A. (1986). Software reuse through information retrieval. ACM SIGIR Forum, 21(1-2) 30-36.
    7. Garcia, V. C., de Almeida, E. S., Lisboa, L. B., Martins, A. C., Meira, S. R. L., Lucredio, D., et al. (2006). Toward a code search engine based on the state-of-art and practice. Software Engineering Conference, 2006. APSEC 2006. 13th Asia Pacific, 61-70.
    8. Gauch, S., & Smith, J. B. (1933). An expert system for automatic query reformation. Journal of the American Society for Information Science, 44(3), 124-36.
    9. Haefliger, S., Von Krogh, G., & Spaeth, S. (2008). Code reuse in open source software. Management Science, 54(1), 180-193.
    10. Harrison, W., Barton, C., & Raghavachari, M. (2000). Mapping UML designs to java. Proceedings of the 15th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, 178-187.
    11. Kagdi, H., Collard, M. L., & Maletic, J. I. (2007). A survey and taxonomy of approaches for mining software repositories in the context of software evolution. Journal of Software Maintenance and Evolution: Research and Practice, 19(2), 77-131.
    12. Ko, Y., An, H., & Seo, J. (2008). Pseudo-relevance feedback and statistical query expansion for web snippet generation. Information Processing Letters, 109(1), 18-22.
    13. Kramer, D. (1999). API documentation from source code comments: A case study of javadoc. Proceedings of the 17th Annual International Conference on Computer Documentation, 153.
    14. Lucredio, D., Prado, A. F., & de Almeida, E. S. (2004). A survey on software components search and retrieval. Euromicro Conference, 2004. Proceedings. 30th, 152-159.
    15. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval.
    16. Marri, M. R., Thummalapenta, S., & Tao Xie. (2009). Improving software quality via code searching and mining. Search-Driven Development-Users, Infrastructure, Tools and Evaluation, 2009. SUITE ''09. ICSE Workshop on, 33-36.
    17. Mitra, M., Singhal, A., & Buckley, C. (1998). Improving automatic query expansion. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 206-214.
    18. Neamtiu, I., Foster, J. S., & Hicks, M. (2005). Understanding source code evolution using abstract syntax tree matching. ACM SIGSOFT Software Engineering Notes, 30(4), 5.
    19. Ogilvie, P., Voorhees, E., & Callan, J. (2009). On the number of terms used in automatic query expansion. Information Retrieval, 12(6), 666-679.
    20. Oyama, S., Kokubo, T., & Ishida, T. (2004). Domain-specific web search with keyword spices. IEEE Transactions on Knowledge and Data Engineering, 17-27.
    21. Poshyvanyk, D., Marcus, A., & Dong, Y. (2006). JIRiSS-an eclipse plug-in for source code exploration. Proceedings of the 14th IEEE International Conference on Program Comprehension, 252-255.
    22. Qiu, Y., & Frei, H. P. (1993). Concept based query expansion. Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 160-169.
    23. Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, 42-49.
    24. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1996). Okapi at TREC-4. Proceedings of the Fourth Text Retrieval Conference, 73–97.
    25. Robertson, S., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 241.
    26. Ruthven, I., & Lalmas, M. (2003). A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 18(02), 95-145.
    27. Sager, T., Bernstein, A., Pinzger, M., & Kiefer, C. (2006). Detecting similar java classes using tree algorithms. Proceedings of the 2006 International Workshop on Mining Software Repositories, 71.
    28. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523.
    29. Sindhgatta, R. (2006). Using an information retrieval system to retrieve source code samples. Proceedings of the 28th International Conference on Software Engineering, 908.
    30. Stylos, J., & Myers, B. A. (2006). Mica: A web-search tool for finding API components and examples. Proceedings of VL/HCC, 6 195-202.
    31. Thummalapenta, S., & Xie, T. (2007). Parseweb: A programmer assistant for reusing open source code on the web. Proceedings of the Twenty-Second IEEE/ACM International Conference on Automated Software Engineering, 204-213.
    32. Thummalapenta, S., & Xie, T. (2008). Spotweb: Detecting framework hotspots via mining open source repositories on the web. Proceedings of the 2008 International Working Conference on Mining Software Repositories, 109-112.
    33. Zhong, H., Xie, T., Zhang, L., Pei, J., & Mei, H. (2009). MAPO: Mining and recommending API usage patterns. ECOOP 2009 Object-Oriented Programming, 318-343.

    網頁資料
    34. Codase Source Code Search Engine.2010年5月30日取自http://www.codase.com/
    35. Google Code Search Engine.2010年5月30日取自http://www.google.com/codesearch/
    36. JDT plug-in developer guide.2010年5月30日取自http://help.eclipse.org/ganymede/index.jsp?topic=/org.eclipse.jdt.doc.isv/reference/api/org/eclipse/jdt/core/dom/package-summary.html
    37. Koders source code search engine.2010年5月30日取自http://www.koders.com/
    38. Krugle source code search engine.2010年5月30日取自http://www.krugle.org/
    39. Object management group: Unified modeling language.2010年5月30日取自http://www.uml.org/
    40. SourceForge.net. Open source software.2010年5月30日取自http://sourceforge.net/

    QR CODE
    :::