從生物文件中萃取出蛋白質或基因之名稱｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	鄭煜璋 Yu-Chang Cheng
論文名稱：	從生物文件中萃取出蛋白質或基因之名稱 Extracting protein/gene names from the biological literatures
指導教授：	何錦文 Chin-Wen Ho
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
畢業學年度：	93
語文別：	英文
論文頁數：	46
中文關鍵詞：	自然語言處理、文件探勘
外文關鍵詞：	Biomedical Name Entity Extraction, Natural Language Processing, Text Mining
相關次數：	點閱：7 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來生物技術逐漸進步，大型實驗產生相當大量的資料與文件，如何在這些使用自然語言(如英文)的文件中萃取出有用的資訊，使得這些萃取出來的資料可以進一步分析變的越來越重要。
無論我們感興趣的是想從文件中了解生物體內每個環節的交互作用亦或是生物物質的註解，這項研究的第一步就是要先能讓電腦辨識出文件中，我們感興趣的物質名稱。這個研究即是在生物文件中，辨識出所有蛋白質的名稱。我們提出了一個系統來辨識出蛋白質或基因的名稱。這個系統主要依據人造的規則，外加機器學習機制讓系統表現的更好。這個系統在這個研究領域有名的文件集Yapex上，達到了F-score 73.8%的水準。

New high-throughput technologies have increased the accumulation of data about genes and proteins. However, such data is stored in natural language text. Further processing and integrating data into more complete and useful knowledge become harder for researchers because of tremendous amount of literature. Therefore, automatic literature mining is more and more important in recent years.
The first step to extract knowledge from natural language text is to extract the named entities out of text, and then the relation between named entities can be constructed. Here we propose a new system to extract the named entities (especially named entities refer to proteins or genes) from the literature in biological domain such as MEDLINE abstracts. The system is mainly rule-based and combined with an SVM machine learning module for improving the system performance. It achieves an F-score 73.8% on the Yapex corpus.

List of Figures	II
List of Tables	III
Chapter 1. Introduction	1
1 Motivation	1
2 Research Goal	3
Chapter 2. Related Work	5
1 Dictionary-based methods	5
2 Rule-based methods	6
3 Machine learning methods	7
4 Corpora	9
5 Results of early works	10
Chapter 3. Methods	12
1 System overview	12
2 Tokenization and POS tagging	16
3 Token selector	17
3.1 Selection rules	18
3.2 Filtering rules	22
3.3 SVM module	24
4 Extending module	29
4.1 Left extending	29
4.2 Right extending	30
5 Post filter	32
6 Abbreviation recovery	33
Chapter 4. Results	36
2 Evaluation Criterion	36
1 Results in Yapex Corpus	37
Chapter 5. Conclusion	40
1 Discussion	40
2 Future works	42
Reference	44

                                

Chang, J.T., Schutze, H., Altman, R. B. 2004. GAPSCORE: finding gene and protein names one word at a time. Bioinformatics. 2004 Jan. 22; 20(2):216-25.
Collier, N., Nobata, C. and Tsujii J. I. 2000. Extracting the Names of Genes and Gene Products with a Hidden Markov Model. Proceedings of 18th International Conference on Computational Linguistics. pp. 201-207.
Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Sinclair, G. and Manning, C. D. 2004. Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004).
Franzén, K., Eriksson, G., Olsson, F., Asker, L. and Lidén, P. 2002. Exploiting syntax when detecting protein names in text. In Workshop on Natural Language Processing in Biomedical Applications, 2002.
Fukuda, K., Tamura, A., Tsunoda, T., Takagi, T. 1998. Toward information extraction: identifying protein names from biological papers. Pac. Symp. Biocomput. 1998:707-18.
Hanisch, D., Fluck, J., Mevissen, H. and Zimmer, R. 2003. Playing biology’s name game: identifying protein names in scientific text. Pac. Symp. Biocomput., 8, 403–41
Joachims, T., Schölkopf , B., Burges, C. and Smola, A. (ed.) 1999. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, MIT-Press
Kazama, J., Makino, T., Ohta, Y. and Tsujii, J. 2002. Tuning support vector machines for biomedical named entity recognition. In Proc. of ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, pages 1-8.
Klein, D., Smarr, J., Nguyen, H. and Manning, C. D. 2003. Named Entity Recognition with Character-Level Models. In Proceedings of CoNLL-2003.
Krauthammer M., Rzhetsky A., Morozov P. and Friedman C. 2000. Using blast for identifying gene and protein names in journal articles. Gene, 259, 245–252.
Lee, K. J., Hwang, Y. S. and Rim, H. C. 2003. Two-phase biomedical NE recognition based on SVMs. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pages 33-40, 2003.
Lin, Y. F., Tsai, T. H., Chou, W. C., Wu, K. P., Sung, T. Y., Hsu, W. L., 2004. A Maximum Entropy Approach to Biomedical Named Entity Recognition. Proceedings of 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics (BioKDD), 2004.
Liu, H., Aronson, A.R. and Friedman, C. 2002. A study of abbreviations in MEDLINE abstracts. Proceedings of the American Medical Informatics Association Symposium 2002. PA, USA, pp. 327-332.
Mika, S. and Rost, B. 2004. Protein names precisely peeled off free text. Bioinformatics. 2004 Aug 4; 20 Suppl 1:I241-I247.
Ohta, T., Tateisi, Y., Mima, H. and Tsuiji, J. 2002. GENIA corpus: an annotated research abstract corpus in molecular biology domain. In Proceedings of the Human Language Technology conference, pages 73-77.
Olsson F, Eriksson G, Franzen K, Asker L, Liden P. 2002. Notions of correctness when evaluating protein name taggers. In: Proceedings of the 19th International Conference on Computational Linguistics. pages 765-71.
Ono T., Hishigaki H., Tanigami A., Takagi T. 2001. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics. 2001 Feb; 17(2):155-61.
Schwartz A. S. and Hearst M. A. 2003. A Simple algorithm for identifying abbreviation definitions in biomedical text. In Proceedings of the Pacific Symposium on Biocomputing (PSB 2003) Kauai.
Seki, K. and Mostafa, J. 2003. A Probabilistic Model for Identifying Protein Names and Their Name Boundaries. Stanford, CA: IEEE Computer Society Bioinformatics Conference, 2003.
Shatkay, H., Feldman, R. 2003. Mining the Biomedical Literature in the Genomic Era: An Overview. J Comput Biol. 2003; 10(6):821-55.
Shen, D., Zhang, J., Zhou, G., Su, J. and Tan, C. L. 2003. Effective adaptation of hidden Markov model-based named entity recognizer for biomedical domain. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pages 49-56, 2003.
Tanabe, L. and Wilbur, W. J. 2002. Tagging gene and protein names in biomedical text. Bioinformatics Vol. 18 no. 8 2002
Takeuchi, K. and Collier, N. 2004. Bio-medical entity extraction using support vector machines. In Artificial Intelligence in Medicine, Elsevier (in press).
Zhou G. D. and Su J. 2002. Named Entity Recognition using an HMM-based Chunk Tagger. Proc. of the 40th ACL, Philadelphia, 2002 July, pp. 473-480.
Zhou, G. D., Zhang, J., Su, J., Shen, D., Tan, C. 2004a. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics. 2004 May 1; 20(7):1178-90.
Zhou, G. D., Shen, D., Zhang, J., Su, J. and Tan, C.L. 2004b. Recognition of protein/gene names from text using an ensemble of classifiers and effective abbreviation resolution. EMBO Workshop 2004 on a critical assessment of text mining methods in molecular biology.

簡易檢索 / 詳目顯示

相關論文