跳到主要內容

簡易檢索 / 詳目顯示

研究生: 黃珏倪
Jue-Ni Huang
論文名稱: 針對病歷之疾病命名實體標註以及醫院科別病歷轉移學習之分析
Disease NER of Medical Records and Analysis of Transfer Learning of Medical Records between Different Hospital Departments
指導教授: 蔡宗翰
Tzong-Han Tsai
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 49
中文關鍵詞: 生醫文獻探勘機器學習自然語言處理轉移學習疾病命名實體標註
外文關鍵詞: Biomedical text mining, Machine learning, Natural language processing, Transfer learning, Disease named entity recognition
相關次數: 點閱:12下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著自然語言處理相關技術的快速發展,其在跨領域的應用上也有相當的發展。生醫文本探勘是生醫領域相關研究中重要的目的之一,隨著相較於以前的紙本記錄更趨向電子化的紀錄方式,在生醫文本探勘中也提供更多的資源去做研究。我們以醫院病歷作為研究方向,針對不同醫院科別間的病歷轉移學習作為主要目的。要達到這項目標,我們會使用到生醫領域的命名實體標註技術(Named Entity Recognition),藉由其預測出在病歷中的疾病名稱,使醫療人員在統整記錄診斷時能有相當的幫助。過去的研究中,大致上分類為基於規則的生醫文本命名實體標註以及基於字典的命名實體標註兩大方向。但這兩者共同的缺點為會有文字的歧異性,並不能良好的區分語意問題。
    為了解決這樣的問題,我們使用機器學習的方法,BioBERT(Bidirectional Encoder Representations from Transformers for Biomedical Text Mining)則是在生醫自然語言處理領域中相當重要的技術之一。在我們的實驗中,我們將以醫院的科別為單位去做病歷文本探勘,以分析在不同科的病歷所訓練出的模型轉移學習到其他科別時的效果與不同科之間的文本差異。


    With the rapid development of natural language processing (NLP), there has been considerable development in their cross-domain applications. Biomedical text mining is one of the most important purposes in biomedical research, and with the move towards electronic records as opposed to paper records, it provides more resources for biomedical text mining. We use hospital medical records as the research data source, and the primary objective is to apply the transfer learning of medical records between different hospital departments. To achieve this goal, we use Named Entity Recognition (NER), a technique used in the biomedical field that predicts the name of a disease
    in the patient’s record, to help medical experts in the consolidation of diagnoses. In the past studies, the two main approaches are roughly classified as rule-based biomedical text NER and dictionary-based NER. However, their common disadvantage is textually ambiguous, which is not the best way to distinguish semantic problems.
    BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) is one of the most important technologies in the field of biomedical natural language processing, and we use machine learning to solve ambiguous word problems. In our experiment, We will apply text mining on medical records through hospital departments in order to analyze the effect of transferring the model trained in medical records from different departments to other departments and the differences in text between different departments.

    中文摘要..................................................................................................... i 英文摘要..................................................................................................... ii 謝誌............................................................................................................. iii Contents ...................................................................................................... iv List of Figures ............................................................................................. v List of Tables............................................................................................... vi Chapter 1 Introduction ................................................................. 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Description . . . . . . . . . . . . . . . . . . . 2 Chapter 2 Related Work............................................................... 4 2.1 Named Entity Recognition . . . . . . . . . . . . . . . . 4 Chapter 3 Methodology ............................................................... 6 3.1 Data Annotation . . . . . . . . . . . . . . . . . . . . . . 6 3.1.1 Data Annotation for Machine Learning . . . . . . . . . . 6 3.1.2 Entity Annotation . . . . . . . . . . . . . . . . . . . . . . 6 3.1.3 Sensitivity and Specificity . . . . . . . . . . . . . . . . . 7 3.2 System Flow . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 4 Experiment .................................................................. 12 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Experimental Settings and Results . . . . . . . . . . . . 15 Chapter 5 Discussion ................................................................... 25 Chapter 6 Conclusion................................................................... 33 Reference .................................................................................................... 35

    [1] Fei Zhu, Preecha Patumcharoenpol, Cheng Zhang, Yang Yang, Jonathan Chan, Asawin Meechai, Wanwipa Vongsangnak, and Bairong Shen. Biomedical text mining and its applications in cancer research. Journal of Biomedical Informatics, 46(2):200–211, 2013.
    [2] Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. Mimic-iii, a freely accessible critical care database. Scientific Data, 3(1):160035, 2016.
    [3] Do˘gan RI, Leaman R, and Lu Z. Ncbi disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform, 47:1–10, 2014.
    [4] Ulf Leser and J¨org Hakenberg. What makes a gene name? named entity recognition in the biomedical literature. Briefings in Bioinformatics, 6(4):357–369, 2005.
    [5] D. Nadeau and S. Sekine. A survey of named entity recognition and classication. Lingvisticae Investigationes, 30(1):3–26, 2007.
    [6] Aaron M. Cohen and William R. Hersh. A survey of current work in biomedical text mining. Briefings in Bioinformatics, 6(1):57–71, 2005.
    [7] Dietrich Rebholz-Schuhmann, Antonio Jimeno Yepes, Chen Li, Senay Kafkas, Ian Lewin, Ning Kang, Peter Corbett, David Milward, Ekaterina Buyko, Elena Beisswanger, Kerstin Hornbostel, Alexandre Kouznetsov, Ren´e Witte, Jonas B. Laurila, Christopher J. O. Baker,
    Cheng-Ju Kuo, Simone Clematide, Fabio Rinaldi, Rich´ard Farkas, Gy¨orgy M´ora, Kazuo Hara, Laura I. Furlong, Michael Rautschka, Mariana Lara Neves, Alberto Pascual-Montano, Qi Wei, Nigel Collier, Md Faisal Mahbub Chowdhury, Alberto Lavelli, Rafael Berlanga, Roser Morante, Vincent Van Asch, Walter Daelemans, Jos´e Lu´ıs Marina, Erik van Mulligen, Jan Kors, and Udo Hahn. Assessment of ner solutions against the first and second calbc silver standard corpus. Journal of Biomedical Semantics, 2(5):S11, 2011.
    [8] Carol Friedman, Philip O. Alderson, John H. M. Austin, James J. Cimino, and Stephen B. Johnson. A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics
    Association, 1(2):161–174, 1994.
    [9] T. C. Rindflesch, L. Tanabe, J. N. Weinstein, and L. Hunter. Edgar: extraction of drugs, genes and relations from the biomedical literature. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pages 517–528, 2000.
    [10] Tanabe L and Wilbur WJ. Tagging gene and protein names in biomedical text. Bioinformatics, 18(8):1124–1132, 2002.
    [11] Aronson AR. Effective mapping of biomedical text to the umls metathesaurus: the metamap program. Proc AMIA Symp, pages 17–21, 2001.
    [12] Cyril Goutte and Eric Gaussier. A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. Advances in Information Retrieval, pages 345–359. Springer Berlin Heidelberg.
    [13] X. Zhu and A.B. Goldberg. Introduction to semi-supervised learning. Morgan and Claypool Publishers, 2009.
    [14] Abdul Ghaaliq Lalkhen and Anthony McCluskey. Clinical tests: sensitivity and specificity. Continuing Education in Anaesthesia Critical Care Pain, 8(6):221–223, 2008.
    [15] D. G. Altman and J. M. Bland. Diagnostic tests. 1: Sensitivity and specificity. BMJ (Clinical research ed.), 308(6943):1552–1552, 1994.
    [16] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2019.
    [17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, page arXiv:1810.04805, October 2018.
    [18] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations ofWords and Phrases and their Compositionality. arXiv e-prints, page arXiv:1310.4546, October 2013.
    [19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv e-prints, page arXiv:1706.03762, June 2017.
    [20] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, WeiWang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol
    Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv e-prints, page arXiv:1609.08144, September 2016.
    [21] Giorgi JM and Bader GD. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics, 34(23):4087–4094, 2018.
    [22] Baoli Li and Liping Han. Distance weighted cosine similarity measure for text classification. In Hujun Yin, Ke Tang, Yang Gao, Frank Klawonn, Minho Lee, Thomas Weise, Bin Li, and Xin Yao, editors, Intelligent Data Engineering and Automated Learning – IDEAL 2013, pages 611–618, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
    [23] Yin Zhang, Rong Jin, and Zhi-Hua Zhou. Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1:43–52, 12 2010.
    [24] Sandeep Tata and Jignesh Patel. Estimating the selectivity of tf-idf based cosine similarity predicates. Sigmod Record, 36, 06 2007.
    [25] Ivan Dokmanic, Reza Parhizkar, Juri Ranieri, and Martin Vetterli. Euclidean Distance Matrices: Essential theory, algorithms, and applications. IEEE Signal Processing Magazine, 32(6):12–30, November 2015.
    [26] Kimberly A. Lochner and Christine S. Cox. Prevalence of multiple chronic conditions among medicare beneficiaries, united states, 2010. Centers for Disease Control and Prevention, Atlanta, Georgia 30333, USA., 2013.
    [27] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. arXiv e-prints, page arXiv:1705.02364, May 2017.
    [28] Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Lyn Untalan Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-C´espedes, Steve Yuan, Chris Tar, Yun hsuan Sung, Brian Strope, and Ray Kurzweil. Universal sentence encoder. In In submission to:
    EMNLP demonstration, Brussels, Belgium, 2018. In submission.

    QR CODE
    :::