跳到主要內容

簡易檢索 / 詳目顯示

研究生: 廖振傑
Jhen-jie Liao
論文名稱: 藉由資料探勘的排序方式提昇程式碼搜尋品質─以Koders為例
Using Data Mining Technology to Refine Koders Code Search Results
指導教授: 林熙禎
Shi-jen Lin
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
畢業學年度: 97
語文別: 中文
論文頁數: 54
中文關鍵詞: 資料探勘開放原始碼程式碼搜尋引擎階層演算法群集分析
外文關鍵詞: Cluster Analysis, Code Search Engine, Open Source Code, Data Mining
相關次數: 點閱:17下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著開放原始碼軟體的普及與日益倍增,有愈來愈多的開放原始碼可以從網路上取得。因而興起了一種新的網路服務─程式碼搜尋。程式碼搜尋引擎提供了程式開發者一個便利的管道,幫助程式開發者快速使用一些已經存在的類別或架構所提供的應用程式介面 (Application Programming Interfaces, APIs) ,藉此提昇軟體生產效率。然而這些從網路上所取得的程式碼搜尋結果,往往無法有效的解決程式開發者的需求。主要是因為有許多相似或不相關的檔案出現於程式碼搜尋結果之中,造成程式開發者無法快速取得有用的程式碼。
    因此本研究提出一個改良搜尋引擎的系統架構,透過自己撰寫的網頁擷取程式將 Koders 的搜尋結果存取至資料庫當中;再透過本研究定義的資料前處理動作,進行資料清理。不只是使用關鍵字搜尋還考慮到程式的結構化特性;之後再透過資料探勘的階層演算法進行分群與重新排序,並且在每一個群集上賦予新的標籤,希冀可以使得搜尋結果更符合使用者的需求。
    最後本研究使用案例的方式來解釋所提出的系統架構是否可以有效改善搜尋結果,並且與相關的學術研究做比較與分析。


    With the popularity of open source software, there are more and more source codes could be downloaded over the Internet. Thus a new Internet service, code search engine emerged. Code search engine provides a convenient way to help developers to reuse existing Application Programming Interfaces (APIs) and improve software productivity. However, these search results obtained from the code search engine cannot effectively satisfy developers’ needs. This is because there are many unrelated files appear in code search results and it makes the developer couldn’t get useful code quickly.
    Therefore, we propose a system architecture to improve the existing search engine. First, we develop a web program to extract the Koders’ search results and store the data to the local repository. Second, we define a rule to filter unrelated files and parse these files into the database format in the data preprocessing stage. Third, some data mining algorithms were used to cluster and re-rank the Koders’ search results. Fourth, we use some unique tags to identify clusters and expect the search results can satisfy the developers’ needs.
    Finally, we use a case to explain whether the proposed system architecture can effectively help developers to find out the useful source code, and compare with related prior research.

    摘 要 i Abstract ii 誌 謝 iii 目 錄 iv 圖目錄 vi 表目錄 viii 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 2 1.3 研究目的 3 1.4 研究方法 4 1.5 論文架構 5 第二章 文獻探討 6 2.1 開放原始碼簡介 6 2.2 程式碼比對 8 2.3 程式碼排序 10 2.4 資料標籤辨識 12 2.5 資料探勘與群集分析 14 2.5.1 資料探勘簡介 14 2.5.2 群集分析 14 2.6 小結 20 第三章 系統設計與架構 21 3.1 系統架構 21 3.2 程式碼搜尋引擎 22 3.3 資料前處理 23 3.4 程式碼擷取 24 3.5 資料探勘與排序 26 第四章 實驗結果與討論 30 4.1 系統實作與案例說明 30 4.2 演算法效能評估 34 4.3 系統效能評估 38 4.4 相關搜尋引擎比較 43 4.5 相關研究比較 46 第五章 結論與未來研究方向 48 5.1 結論 48 5.2 未來研究方向 49 參考文獻 51 中文參考文獻 51 英文參考文獻 51 網頁資料 53

    1. 平震宇,「一個適用於行動裝置的網頁搜尋結果分群系統之研究」,元智大學資訊管理研究所碩士論文,2007。
    2. 洪菁憶,「循序探勘在軟體版本控制上的應用」,中央大學資訊管理研究所碩士論文,2008。
    3. 陳文華,「應用資料倉儲系統建立CRM」,資訊與電腦,pp.122-127,1999。
    4. 龔良民,「衍生性群集分析方法之探定理論與應用」,中山大學資訊管理研究所碩士論文,1998。
    5. Bajracharya, S., Ngo, T., Linstead, E., Dou, Y., Rigor, P., Baldi, P., and Lopes, C., “Sourcerer: A search engine for open source code supporting structure-based search.” In Proc. of OOPSLA’06 Companion, pp. 25-26, 2006.
    6. Berry, M. J. A., and Linoff, G., “Data Mining Technique for Marketing.” Sale, and Customer Support, Wiley Computer, 1997.
    7. Day, W. H. E., and Edelsbrunner, H., “Efficient algorithms for agglomerative hierarchical clustering methods.” Journal of Classification (1:1), pp. 7-24, 1984.
    8. Frawley, W. J., Piatetsky-Shapiro, G., and Matheus, C. J., “Knowledge discovery in databases: An overview.” AI Magazine (13:3), pp. 57-70, 1992.
    9. Grupe, F. H., and Owrang, M. M., “Database Mining Discovering New Knowledge and Cooperative Advantage,” Information System Management (12:4), pp. 26-30, 1995.
    10. Holmes, R., and Murphy, G. C., “Using structural context to recommend source code examples.” 27th International Conference on Software Engineering, pp. 117-125, 2005.
    11. Holmes, R., Walker, R. J., and Murphy, G. C., “Approximate structural context matching: An approach to recommend relevant examples.” IEEE Transactions on Software Engineering (32:12), pp. 958-970, 2006.
    12. Jiawei, H., and Micheline, K., “Data Mining:Concepts and Techniques,” Morgan Kaufmann, pp. 59-60, 2001.
    13. Kaufman, L., and Rousseeuw, P. J., “Finding Groups in Data: An Introduction to Cluster Analysis.” John Wiley & Sons Inc, 2005
    14. Kawaguchi, S., Garg, P. K., Matsushita, M., and Inoue, K., “Automatic categorization algorithm for evolvable software archive.” 6th International Workshop on Principles of Software Evolution, pp. 195-200, 2003.
    15. Kuhn, A., Ducasse, S., and Gírba, T., “Semantic clustering: Identifying topics in source code.” Information and Software Technology (49:3), pp.230-243, 2007.
    16. Linstead, E., Rigor, P., Bajracharya, S., Lopes, C., and Baldi, P., “Mining concepts from code with probabilistic topic models.” Proceedings of the twenty-second IEEE/ACM international conference on automated software engineering, November 05-09, 2007.
    17. Lorigo, L., Pan, B., Hembrooke, H., Joachims, T., Granka, L., and Gay, G., “The Influence of Task and Gender on Search and Evaluation Behavior Using Google.” Information Processing and Management (42), pp. 1123-1131, 2006.
    18. Mandelin, D., Xu, L., Bodik, R., and Kimelman, D., “Jungloid mining: helping to navigate the API jungle.” In Proc. of PLDI 2005, pp. 48-61, 2005.
    19. Rousidis, D., and Tjortjis, C., “Clustering Data Retrieved from Java Source Code to Support Software Maintenance: A Case Study.” Proceedings of the Ninth European Conference on Software Maintenance and Reengineering, pp.276-279, 2005.
    20. Sahavechaphan, N., and Claypool, K., “XSnippet:Mining for sample code.” In Proc. of OOPSLA, pp. 413–430, 2006.
    21. Thummalapenta, S., and Xie, T., “PARSEWeb:A Programmer Assistant for Reusing Open Source Code on the Web.” In Proc. of ASE 2007, pp. 204-213, 2007.
    22. Xie, T., and Pei, J., “MAPO: Mining API usages from open source repositories.” In Proc. of MSR’06, pp. 54-57, 2006.
    23. 自由軟體鑄造場(Open Source Software Foundry), http://www.openfoundry.org/
    24. Codase source code search engine, http://www.codase.com/
    25. Google Code Search Engine, http://www.google.com/codesearch/
    26. Koders source code search engine, http://www.koders.com/
    27. Krugle source code search engine, http://www.krugle.org/
    28. SourceForge.net: Open Source Software, http://sourceforge.net/

    QR CODE
    :::