| 研究生: |
魏建豪 Jian-Hao Wei |
|---|---|
| 論文名稱: |
基因序列的k 字齊普夫子集解析 k-tuple Zipf m-Set analysis on DNA |
| 指導教授: |
李弘謙
Hoong-Chien Lee |
| 口試委員: | |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
理學院 - 物理學系 Department of Physics |
| 畢業學年度: | 99 |
| 語文別: | 中文 |
| 論文頁數: | 123 |
| 中文關鍵詞: | 高頻字 、排名 、字的發生頻率 、全基因序列 、語言 、齊普夫定律 、編碼區 、非編碼區 、外顯子 、內含子 、頻率分佈 、冪次分佈 |
| 外文關鍵詞: | coding parts, high-frequency words, ranking, k-mers, frequency of occurrence of words, complete genome sequences, noncoding parts, Zipf’s law; natural language, exons, introns, power-law distribution, frequency distribution |
| 相關次數: | 點閱:19 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
一個普遍被使用的數理統計方法-齊普夫定律,1994年被Mantegna與他的研究團隊使用在基因序列k字串的發生頻率與其排名的解析上(k字串齊普夫解析),強調非編碼區有類語言的冪次規則。不過,這樣的結論被大量的質疑與討論。
我們整理不同的齊普夫分佈研究領域,發現觀察的重點雖不盡相同,但事件總數為N時,各別事件在隨機狀態時機率均為1/N。然而,基因序列在序列的p(序列A+T含量所佔比)越遠離一半時,各別字串的機率在隨機狀態差異越大,因此在非隨機狀態中,機率不等是受到p與生物特徵兩個因素造成,影響齊普夫分佈的解析判斷。
這個研究中,我們運用不同p的基因體序列與其對應的隨機序列的數據,證實k字串齊普夫子集解析法可以去除p的影響,改善k字串齊普夫解析難以定義隨機序列冪次的障礙,確立子集解析的優勢。
另外,我們擬合四個函式(直線、指數、對數、冪次)選定足以代表物種特徵的「高頻字」(高頻率出現的字串),並嘗試找出865個物種高頻字冪次的普適性。研究結果顯示物種的冪次與其物種複雜度有關,傳達基因複製的演化結果。
Zipf’s law is a characterization of the relation between the frequency of any word in a text and the ranking of that word in the frequency table. It states that if the text is that of a natural language, then the frequency versus ranking relation is an approximate power law. For a few years in the mid to late 1990’s Zipf’s law was intensely discussed in the context of genomic sequences, but no clear consensus was reached as to whether, as a general rule, the word frequencies -- a genomic a word is an oligonucleotide of a given length; we call a k-nucleotide word a k-mer -- in genomic sequences, or some specific portion thereof, obey a Zipf’s law. Here we revisit the issue by studying the frequency versus ranking relations of a large number of complete genomes, and of parts of genomes having different biological functions. We show that the nucleotide composition has an influence on the frequency versus rank relation of a genomic sequence that is strong enough to mask whatever Zipf’s-law behavior the sequence may possess. Once this influence is removed, then all genomes obey the same broadly defined classes of Zipf’s laws, with the most important class-defining factor being the length of k-mers, or the integer k. For eukaryotes, the Zipf’s laws for the exonic and intronic segments of the genome differ significantly. Based on the observation that the Zipf’s law of a sequence is determined by the subset of k-mers having the highest frequencies (of occurrence), we derive a relation between the Zipf’s-law exponent and the high-frequency tail of the frequency distribution, and infer that for genomes in general the high-frequency tail is best represented by an exponential function, as opposed to linear, logarithmic, or power-law functions.
1. Mantegna, R.N., et al., Linguistic Features of Noncoding DNA-Sequences. Physical Review Letters, 1994. 73(23): p. 3169-3172.
2. Mantegna, R.N., et al., Systematic Analysis of Coding and Noncoding
DNA-Sequences Using Methods of Statistical Linguistics. Physical Review E, 1995. 52(3): p. 2939-2950.
3. Ramsden, J.J. and J. Vohradsky, Zipf-like behavior in procaryotic protein expression. Physical Review E, 1998. 58(6): p. 7777-7780.
4. Li, W.T., Zipf’s Law in Importance of Genes for Cancer Classification Using Microarray Data. J. theor. Biol. , 2002 219: p. 539–551.
5. Hernando, A., C. Vesperinas, and A. Plastino, Fisher information and the thermodynamics of scale-invariant systems. Physica A 2010 389(490-498).
6. Tan, M.H.e.a., Relationship between Zipf dimension and fractal dimension of city-size distribution. . Geographical research, 2004 23(2): p. 243-248.
7. Gong, X.Q. and Z. Wang, A Note on the Zipf’s Law. Complex Systems and Complexity Science 2008 5(3): p. 73-78.
8. Bernat, C.M.e.a., Universality of Zipf’s law. Phys. Rev. E 2010 82: p. 011102.
9. Yi, L.U., Analysis of forest resource scale usiong on Zipf’s law. Journal of Nanjing Forestry University (Natural Science Edition) 2009 33(2): p. 73-76.
10. Chen, H.D., The Footprint of Evolution Duplication- Universal Equivallent Length of Genomes., in NCU. 2009
11. Li, W.T., Zipf’s Law Everywhere. Glottometrics, 2003 5: p. 14-21.
12. Tsay, M.Y., Information-metrics and Document properties 2003 Taipei: Hwa Tai Publishing.
13. Zipf, G.K., Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology Addison-Wesley, Cambridge, MA, 1949.
14. You, R.Y., Zipf’s Law and the Distribution of Chinese Character Frequency. .Journal of Chinese Information Processing, 1999. 14((3)): p. 60-65.
15. Kosmidis, K., A. Kalampokis, and P. Argyrakis, Language time series analysis. Physica a-Statistical Mechanics and Its Applications, 2006. 370(2): p. 808-816.
16. Manning, C.D.e.a., Foundations of Statistical Natural Language Processing. . 1999 MIT Press.
17. Li, W.T., Random Texts Exhibit Zipf-Law-Like Word-Frequency Distribution. Ieee Transactions on Information Theory, 1992. 38(6): p. 1842-1845.
18. Havlin, S., The Distance between Zipf Plots. Physica a-Statistical Mechanics and Its Applications, 1995. 216(1-2): p. 148-150.
19. Cancho, R.F.I. and R.V. Sole, Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences of the United States of America, 2003. 100(3): p. 788-791.
20. Ferrer-i-Cancho, R. and B. Elvevag, Random Texts Do Not Exhibit the Real Zipf''s Law-Like Rank Distribution. Plos One, 2010. 5(3): p. e9411.
21. Bol´an, B.C.e.a., Statistical properties and linguistic coherence in noncoding DNA sequences. Rev. Mex. Fis. E, 2005. 51(2): p. 118–125.
22. Flam, F., Hints of a Language in Junk DNA. Science, 1994. 266(5189): p.1320-1320.
23. Konopka, A.K. and C. Martindale, Noncoding DNA, Zipf''s law, and language. Science, 1995. 268(5212): p. 789.
24. Voss, R.F., Linguistic features of noncoding DNA sequences - Comment. Physical Review Letters, 1996. 76(11): p. 1978.
25. Mantegna, R.N., S.V. Buldyrev, and A.L. Goldberger, Mantegna et al. Reply:. Phys. Rev. Lett. , 1996. 76, : p. 1979-1981.
26. Furusawa, C. and K. Kaneko, Zipf''s law in gene expression. Physical Review Letters, 2003. 90(8)
27. Ogasawara, O., S. Kawamoto, and K. Okubo, Zipf''s law and human transcriptomes: an explanation with an evolutionary model. Comptes Rendus Biologies, 2003. 326: p. 1097-1101.
28. Ogasawara, O. and K. Okubo, On Theoretical Models of Gene Expression Evolution with Random Genetic Drift and Natural Selection. . Plos One, 2009. 4(11): p. e7943.
29. Powers, M., Applications and Explanations of Zipf’s Law. new methods in language processing and computational natural language learning ACL, 1998 p. 151-160.
30. A., A.L., Zipf’s law and the Internet. Glottometrics, 2002 3: p. 143-150.
31. Stanley, H.E., et al., Scaling features of noncoding DNA. Physica a-Statistical Mechanics and Its Applications, 1999. 273(1-2): p. 1-18.
32. Sellis, D. and Y. Almirantis, Power-laws in the genomic distribution of coding segments in several organisms: An evolutionary trace of segmental duplications, possible paleopolyploidy and gene loss. Gene, 2009. 447(1): p. 18-28.
33. Han, D.D.e.a., Nuclear fragmentation may exist in the Zipf law. Chinese Science Bulletin 2000 45(9): p. 913-918.
34. Bonhoeffer, S., et al., No signs of hidden language in noncoding DNA. Physical Review Letters, 1996. 76(11): p. 1977-1977.
35. Peng, C.K., et al., Statistical Properties of DNA-Sequences. Physica a-Statistical Mechanics and Its Applications, 1995. 221(1-3): p. 180-192.
36. Peng, C.K., et al., Mosaic Organization of DNA Nucleotides. Physical Review E, 1994. 49(2): p. 1685-1689.
37. Peng, C.K., et al., Long-Range Correlations in Nucleotide-Sequences. Nature, 1992. 356(6365): p. 168-170.
38. Peng, C.K., et al., Finite-Size Effects on Long-Range Correlations - Implications for Analyzing DNA-Sequences. Physical Review E, 1993. 47(5): p. 3730-3733.
39. Buldyrev, S.V.e.a., Generalize Lévy-walk model for DNA nucleotide sequences. Phys. Rev. E 1993. 47(6): p. 4514-4523.
40. Azbel’, M.Y., Random Two-Component One-Dimensional Ising Model for Heteropolymer Melting. . Phys. Rev. Lett., 1973. 31(9): p. 589-592.
41. Czirok, A., et al., Correlations in Binary Sequences and a Generalized Zipf Analysis. Physical Review E, 1995. 52(1): p. 446-452.
42. Voss, R.F., Evolution of Long-Range Fractal Correlations and 1/F Noise in DNA-Base Sequences. Physical Review Letters, 1992. 68(25): p. 3805-3808.
43. Li, W.T., Expansion-Modification Systems - a Model for Spatial 1/F Spectra. Physical Review A, 1991. 43(10): p. 5240-5260.
44. Li, W.T., Large-Scale Patterns in DNA Texts. . originally prepared for Scientific American, 1999: p. 1-10.
45. Israeloff, N.E., M. Kagalenko, and K. Chan, Can Zipf distinguish language from noise in noncoding DNA? Physical Review Letters, 1996. 76(11): p. 1976-1976.
46. Trotta, E., et al., 1H NMR study of [d(GCGATCGC)]2 and its interaction with minor groove binding 4'',6-diamidino-2-phenylindole. Journal of Biological Chemistry, 1993. 268(6): p. 3944-51.
47. National center for biotechnology information genome database.
48. Rice annotation project database.
49. Hedges, S.B., The origin and evolution of model organisms. Nature Reviews Genetics, 2002. 3(11): p. 838-849.
50. Hsieh, L.C., et al., Minimal model for genome evolution and growth. Physical Review Letters, 2003. 90(1): p. -.
51. Chen, H.D., et al., Universal Global Imprints of Genome Growth and Evolution – Equivalent Length and Cumulative Mutation Density. PLoS ONE 2010. 5(4): p. e9844, 1-15.