| 研究生: |
吳行展 Sing-Jhan Wu |
|---|---|
| 論文名稱: |
以支持向量機鑑別原核生物之嗜寒、中溫、嗜熱、及超嗜熱蛋白質 Discrimination of psychrophilic, mesophilic thermophilic, and hyperthermophilic proteins in prokaryotes using Support Vector Machine |
| 指導教授: |
黃雪莉
Shir-Ly Huang 洪炯宗 Jorng-Tzong Horng |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
生醫理工學院 - 系統生物與生物資訊研究所 Graduate Institute of Systems Biology and Bioinformatics |
| 畢業學年度: | 96 |
| 語文別: | 英文 |
| 論文頁數: | 99 |
| 中文關鍵詞: | 支持向量機 、蛋白質熱穩定性 、蛋白質嗜寒性 、機器學習演算法 |
| 外文關鍵詞: | Machine learning algorithms, Support vector machine, Protein thermostability, Protein psychrophilicity |
| 相關次數: | 點閱:17 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
蛋白質熱穩定性無論在基礎科學或工業應用上都是很重要的課題,許多研究在同源蛋白質之間進行序列和結構上的比較分析,從中找出對熱穩定具有重要意義的影響因子。過去的研究發現,蛋白質序列上胺基酸組成(Amino Acid Composition)、疏水性交互作用(Hydrophobic Interaction)、離子交互作用(Ionic Interaction)等許多特性都被認為與蛋白質熱穩定有重要關係。相對於嗜熱蛋白質,嗜寒蛋白質的工業應用亦相當重要,但相關研究則相對較少。本研究目的在分析各種蛋白質物化特徵,發展出可預測嗜熱蛋白質及嗜寒蛋白質的系統,並探討不同特徵於四種溫度分類群組間之關係。我們利用NCBI原核生物基因體計畫所提供的資料,截取大量蛋白質及相關溫度資訊,計算出特徵後再配合特徵選取演算法,過濾出與溫度具相關性的重要因子,再運用機器學習方法,建立具有穩定效能的預測模型,我們認為三種型式的胺基酸組成(Amino Acid Composition, Dipeptide Composition, Pseudo Amino Acid Composition)對於蛋白質的溫度分類有顯著的效果。
The study of protein thermostability plays an important role in both basic and applied research. Most of the studies on protein thermostability are focused on the analysis of structure or sequence comparison among homologous proteins, and identify the factors that affect the protein thermostability. Scientists had found key properties that influence protein thermostability, such as amino acid composition, hydrophobic interaction, and ionic interaction, etc. However, the properties correlate to psychrophilic properties of proteins are less studied. The purpose of this study is to analyze the properties of selected pools of proteins by developing a method to predict the thermostability or psychrophilicity. Furthermore, to identify which are the key features We used the data provided by NCBI prokaryotic genome project to select 86470 proteins and the temperature data, the optimal growth temperatures from the source prokaryotes, followed by calculation of protein features by feature selection algorithm. Finally, the vital factors related to temperatures, amino acid composition, dipeptide composition, pseudo amino acid composition are selected. A machine learning method is performed to build a robust prediction model on protein thermostability and psychrophilicity. We believed these three types of amino acid composition have a significant effect on protein temperature classification.
Barutcuoglu, Z., R. E. Schapire, and O. G. Troyanskaya. 2006. Hierarchical multi-label prediction of gene function. Bioinformatics 22:830-6.
Baxevanis, A. D. 2006. Searching the NCBI databases using Entrez. Curr Protoc Hum Genet Chapter 6:Unit 6 10.
Chou, K. C. 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21:10-9.
De Vendittis, E., I. Castellano, R. Cotugno, M. R. Ruocco, G. Raimo, and M. Masullo. 2008. Adaptation of model proteins from cold to hot environments involves continuous and small adjustments of average parameters related to amino acid composition. J Theor Biol 250:156-71.
Dehouck, Y., B. Folch, and M. Rooman. 2008. Revisiting the correlation between proteins'' thermoresistance and organisms'' thermophilicity. Protein Eng Des Sel.
Demirjian, D. C., F. Moris-Varas, and C. S. Cassidy. 2001. Enzymes from extremophiles. Curr Opin Chem Biol 5:144-51.
Felsenstein, J. 2005. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle.
Frank, E., M. Hall, L. Trigg, G. Holmes, and I. H. Witten. 2004. Data mining in bioinformatics using Weka. Bioinformatics 20:2479-81.
Gromiha, M. M., M. Oobatake, and A. Sarai. 1999. Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophys Chem 82:51-67.
Gromiha, M. M., and M. X. Suresh. 2008. Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins 70:1274-9.
Gupta, R., A. Mittal, and K. Singh. 2008. A novel and efficient technique for identification and classification of GPCRs. IEEE Trans Inf Technol Biomed 12:541-8.
Holm, L., and C. Sander. 1998. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14:423-9.
Huang, J., T. Li, K. Chen, and J. Wu. 2006. An approach of encoding for prediction of splice sites using SVM. Biochimie 88:923-9.
Huang, S. L., L. C. Wu, H. D. Huang, H. K. Liang, M. T. Ko, and J. T. Horng. 2004a. A probabilistic method to correlate ion pairs with protein thermostability. Appl Bioinformatics 3:21-9.
Huang, S. L., L. C. Wu, H. K. Liang, K. T. Pan, J. T. Horng, and M. T. Ko. 2004b. PGTdb: a database providing growth temperatures of prokaryotes. Bioinformatics 20:276-8.
Huang, S. W., and J. K. Hwang. 2005. Computation of conformational entropy from protein sequences using the machine-learning method--application to the study of the relationship between structural conservation and local structural stability. Proteins 59:802-9.
Jahandideh, S., E. Barzegari Asadabadi, P. Abdolmaleki, M. Jahandideh, and S. Hoseini. 2007. Protein psychrophilicity: role of residual structural properties in adaptation of proteins to low temperatures. J Theor Biol 248:721-6.
Letunic, I., and P. Bork. 2007. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23:127-8.
Li, W., L. Jaroszewski, and A. Godzik. 2001. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17:282-3.
Li, Z. R., H. H. Lin, L. Y. Han, L. Jiang, X. Chen, and Y. Z. Chen. 2006. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 34:W32-7.
Livingstone, C. D., and G. J. Barton. 1993. Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput Appl Biosci 9:745-56.
Page, R. D. 2003. Introduction to inferring evolutionary relationships. Curr Protoc Bioinformatics Chapter 6:Unit 6 1.
Rothschild, L. J., and R. L. Mancinelli. 2001. Life in extreme environments. Nature 409:1092-101.
Siddiqui, K. S., and R. Cavicchioli. 2006. Cold-Adapted Enzymes. Annu Rev Biochem.
SPSS. 2003. SPSS for Windows, Version 11.5. SPSS Inc., Chicago, USA.
Szilagyi, A., and P. Zavodszky. 2000. Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. Structure 8:493-504.
Vieille, C., and G. J. Zeikus. 2001. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiol Mol Biol Rev 65:1-43.
Zhang, G., and B. Fang. 2006a. Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins. Process Biochemistry 41:1792-1798.
Zhang, G., and B. Fang. 2006b. Support vector machine for discrimination of thermophilic and mesophilic proteins based on amino acid composition. Protein Pept Lett 13:965-70.
Zhang, G., and B. Fang. 2007. LogitBoost classifier for discriminating thermophilic and mesophilic proteins. J Biotechnol 127:417-24.
Zhang, G. Y., and B. S. Fang. 2006c. [A study on the discrimination of thermophilic and mesophilic proteins based on dipeptide composition]. Sheng Wu Gong Cheng Xue Bao 22:293-8.
Zhou, X. X., Y. B. Wang, Y. J. Pan, and W. F. Li. 2008. Differences in amino acids composition and coupling patterns between mesophilic and thermophilic proteins. Amino Acids 34:25-33.