跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳聰百
Tsung-Pai Chen
論文名稱: 利用峰點特徵值來分析高解析度蛋白質質譜資料
Analysis of high-resolution protein mass spectrabased on peak feature selection
指導教授: 陳廣典
Kuang-Den Chen
洪炯宗
Jorng-Tzong Horng
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系在職專班
Executive Master of Computer Science & Information Engineering
畢業學年度: 94
語文別: 英文
論文頁數: 48
中文關鍵詞: 質譜校準峰點偵測質譜儀分類預測基線校正
外文關鍵詞: feature selection, SELDI-TOF, MALDI--TOF, classification, peak detection
相關次數: 點閱:10下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 表面強化雷射解析電離飛行質譜(SELDI-TOF)及基質輔助雷射脫附游離法飛行時間質譜(MALDI-TOF)技術是目前使用於辨識生物標記的技術。本論文是使用來自美國國家癌症研究協會的SELDI-TOF卵巢癌資料集,與來自長庚大學的MALDI-TOF口腔癌資料集。樣本皆區分為控制組及癌症病患組。我們的研究目標是縮減質譜的高維度並從中擷取出有意義的特徵峰點。抽取特徵的方法諸如基線校正、峰點偵測、質譜校準等。特徵選取則利用 Kolmogorov-Smirnov檢定(KS 檢定)、Logistic Regression(邏輯斯迴歸)和Random Forest 等方法。有鑑別力的特徵被挑選出來之後再應用三種分類方法來針對資料集做分類預測。
    我們分別挑選了50個和100個最有鑑別力的特徵峰點來做1000次重複隨機性地10-fold 交叉驗證,並利用regression tree with bagging(迴歸樹), k-nearest neighbor(k 個最近鄰居)及SVM(支持向量機)等分類方法所得到的靈敏度(Sensitivity)、特異度(Specificity)、準確度(Accuracy)、精準度(Precision)皆有不錯的分類效果。同時我們也開發了一個質譜相關性查詢系統,去辨識在癌症及非癌症族群有高度相關的峰點值。在此我們提出的分析流程可以提供一個相對較小的特徵峰點資料集,該資料集具有足夠識別力來進行分類預測及相關性分析的研究。


    The SELDI-TOF and MALDI-TOF process are the currently used techniques to identify biomarkers for cancers. Our work has focused on the ovarian cancer dataset that is generated by SELDI-TOF technique from National Cancer Institute, USA. Another study set is the oral cancer dataset that is generated by MALDI-TOF technique from Proteomics Center of Chang Gung University, Taiwan. The aim of this work is to reduce the high dimensionality of the mass spectra and extract the significant peak-features for further study. The methods used such as baseline subtraction, peak detection, spectra alignment and normalization are used for feature extraction. Kolmogorov-Smirnov test, logistic regression and random forest are used for feature selection. After feature selection, discriminatory peak-features are selected and three methods had applied to classify the two classes of the ovarian cancer datasets. The selected 50 and 100 most discriminatory peak-features were applied to do classification with 1000 replications using 10-fold proportional validation independently. The results yielded good accuracy, precision, sensitivity and specificity respectively, by regression tree with bagging, k-nearest neighbor and SVM classifier. We also develop a correlation based query system to identify the highly correlated peaks of cancer and non-cancer groups. The analysis pipeline that we proposed could provide a relatively small peak-feature set that is discriminatory enough for classification and correlation based studies.

    CHAPTER 1 INTRODUCTION 1 1.1 BACKGROUND 1 1.2 MOTIVATION 2 1.3 GOAL 2 CHAPTER 2 RELATED WORKS 4 2.1 MASS SPECTROMETRY: 4 2.2 LOGISTIC REGRESSION IN R 5 2.3 REGRESSION TREE WITH BAGGING IN R 6 2.4 SUPPORT VECTOR MACHINE IN R 6 2.5 K-NEAREST-NEIGHBOR CLASSIFICATION IN R 6 2.6 RANDOM FOREST IN R 7 2.7 LITERATURE REVIEWS 8 2.7.1 Data preprocessing and classification 8 2.7.2 Correlation study 9 CHAPTER 3 MATERIALS AND METHODS 11 3.1 MATERIAL 11 3.2 METHODS 11 3.2.1 Preprocessing for feature extraction: 12 3.2.2 Feature selection: 16 3.2.3 Classification of mass spectra 20 3.2.4 Correlation associated peak-feature networks 21 3.3 SOFTWARE 21 CHAPTER 4 RESULT 23 4.1 N-FOLD PROPORTIONAL CROSS-VALIDATION 20 4.2 RESULTS COMPARISON AND HEAT MAP OF NCI DATA 23 4.3 TEN-FOLD PROPORTIONAL CROSS-VALIDATION 24 4.4 CORRELATION QUERY SYSTEM 25 CHAPTER 5 DISCUSSION AND CONCLUSION 30 REFERENCE 33 APPENDIX A -1 35 APPENDIX A -2 36 APPENDIX B 37 APPENDIX C 38 APPENDIX D 39 APPENDIX E 40

    Alexandros Kalousis, J. P., Elton Rexhepaj and Melanie Hilario (2005). Feature Extraction from Mass Spectra for Classification. Knowledge Discovery in Databases: PKDD 2005: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, October 3-7, 2005, Porto, Portugal, Springer Berlin / Heidelberg.
    Baggerly, K. A., J. S. Morris, et al. (2003). "A comprehensive approach to the analysis of matrix-assisted laser desorption/ionization-time of flight proteomics spectra from serum samples." Proteomics 3(9): 1667-72.
    Breiman, L. (1996). "Bagging Predictors." Machine Learning 24(2): 123-140.
    Breiman, L. (1998). "Arcing Classifiers." The Annals of Statistics 26(3): 801-824.
    Breiman, L. (2001). "Random Forests." Machine Learning 45(1): 5-32.
    Chen, Y. and D. Xu (2003). "Computational analyses of high-throughput protein-protein interaction data." Curr Protein Pept Sci 4(3): 159-81.
    Cheng, A. J., L. C. Chen, et al. (2005). "Oral cancer plasma tumor marker identified with bead-based affinity-fractionated proteomic technology." Clin Chem 51(12): 2236-44.
    Conrads, T. P., V. A. Fusaro, et al. (2004). "High-resolution serum proteomic features for ovarian cancer detection." Endocr Relat Cancer 11(2): 163-78.
    Diaz-Uriarte, R. and S. Alvarez de Andres (2006). "Gene selection and classification of microarray data using random forest." BMC Bioinformatics 7: 3.
    Gentzel, M., T. Kocher, et al. (2003). "Preprocessing of tandem mass spectrometric data to support automatic protein identification." Proteomics 3(8): 1597-610.
    Jacobs, I. J. and U. Menon (2004). "Progress and challenges in screening for early detection of ovarian cancer." Mol Cell Proteomics 3(4): 355-66.
    Keith A. Baggerly, K. R. C., and Jeffrey S. Morris (2005). "Bias, Randomization, and Ovarian Proteomic Data: A Reply to "Producers and Consumers"." Cancer Informatics 1(1): 9-14.
    Li, J., Z. Zhang, et al. (2002). "Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer." Clin Chem 48(8): 1296-304.
    Liaw, A. and M. Wiener (2002). "Classification and regression by randomForest." R News Vol. 2/3: 18-22.
    Liotta, L. A., M. Ferrari, et al. (2003). "Clinical proteomics: written in blood." Nature 425(6961): 905.
    Malyarenko, D. I., W. E. Cooke, et al. (2005). "Enhancement of sensitivity and resolution of surface-enhanced laser desorption/ionization time-of-flight mass spectrometric records for serum peptides using time-series analysis techniques." Clin Chem 51(1): 65-74.
    Markey, M. K., G. D. Tourassi, et al. (2003). "Decision tree classification of proteins identified by mass spectrometry of blood serum samples from people with and without lung cancer." Proteomics 3(9): 1678-9.
    Petricoin, E. F., A. M. Ardekani, et al. (2002). "Use of proteomic patterns in serum to identify ovarian cancer." Lancet 359(9306): 572-7.
    Qu, Y., B. L. Adam, et al. (2002). "Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients." Clin Chem 48(10): 1835-43.
    Ressom, H. W., R. S. Varghese, et al. (2005). "Analysis of mass spectral serum profiles for biomarker selection." Bioinformatics 21(21): 4039-45.
    Sauve AC and S. TP (2004). Normalization, baseline correction and alignment of high-throughput mass spectrometry data. Workshop on Genomic Signal Processing and Statistics (GENSIPS), Baltimore, Maryland, USA.
    Svetnik V, L. A. (2001). Detecting Novel Samples in Mass Spectral Data: A Clustering Approach. Proceedings of the 33rd Symposium on the Interface, Costa Mesa, CA, USA.
    Teneriello, M. G. and R. C. Park (1995). "Early detection of ovarian cancer." CA Cancer J Clin 45(2): 71-87.
    Wagner, M., D. N. Naik, et al. (2004). "Computational protein biomarker prediction: a case study for prostate cancer." BMC Bioinformatics 5: 26.
    Wolski, W. E., M. Lalowski, et al. (2005). "Transformation and other factors of the peptide mass spectrometry pairwise peak-list comparison process." BMC Bioinformatics 6: 285.
    Wong, J. W., G. Cagney, et al. (2005). "SpecAlign--processing and alignment of mass spectra datasets." Bioinformatics 21(9): 2088-90.
    Yu, J. and X. W. Chen (2005). "Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data." Bioinformatics 21 Suppl 1: i487-94.
    Yu, J. S., S. Ongarello, et al. (2005). "Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data." Bioinformatics 21(10): 2200-9.

    QR CODE
    :::