跳到主要內容

簡易檢索 / 詳目顯示

研究生: 魏建豪
Jian-Hao Wei
論文名稱: 基因序列的k 字齊普夫子集解析
k-tuple Zipf m-Set analysis on DNA
指導教授: 李弘謙
Hoong-Chien Lee
口試委員:
學位類別: 博士
Doctor
系所名稱: 理學院 - 物理學系
Department of Physics
畢業學年度: 99
語文別: 中文
論文頁數: 123
中文關鍵詞: 高頻字排名字的發生頻率全基因序列語言齊普夫定律編碼區非編碼區外顯子內含子頻率分佈冪次分佈
外文關鍵詞: coding parts, high-frequency words, ranking, k-mers, frequency of occurrence of words, complete genome sequences, noncoding parts, Zipf’s law; natural language, exons, introns, power-law distribution, frequency distribution
相關次數: 點閱:19下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 一個普遍被使用的數理統計方法-齊普夫定律,1994年被Mantegna與他的研究團隊使用在基因序列k字串的發生頻率與其排名的解析上(k字串齊普夫解析),強調非編碼區有類語言的冪次規則。不過,這樣的結論被大量的質疑與討論。
    我們整理不同的齊普夫分佈研究領域,發現觀察的重點雖不盡相同,但事件總數為N時,各別事件在隨機狀態時機率均為1/N。然而,基因序列在序列的p(序列A+T含量所佔比)越遠離一半時,各別字串的機率在隨機狀態差異越大,因此在非隨機狀態中,機率不等是受到p與生物特徵兩個因素造成,影響齊普夫分佈的解析判斷。
    這個研究中,我們運用不同p的基因體序列與其對應的隨機序列的數據,證實k字串齊普夫子集解析法可以去除p的影響,改善k字串齊普夫解析難以定義隨機序列冪次的障礙,確立子集解析的優勢。
    另外,我們擬合四個函式(直線、指數、對數、冪次)選定足以代表物種特徵的「高頻字」(高頻率出現的字串),並嘗試找出865個物種高頻字冪次的普適性。研究結果顯示物種的冪次與其物種複雜度有關,傳達基因複製的演化結果。


    Zipf’s law is a characterization of the relation between the frequency of any word in a text and the ranking of that word in the frequency table. It states that if the text is that of a natural language, then the frequency versus ranking relation is an approximate power law. For a few years in the mid to late 1990’s Zipf’s law was intensely discussed in the context of genomic sequences, but no clear consensus was reached as to whether, as a general rule, the word frequencies -- a genomic a word is an oligonucleotide of a given length; we call a k-nucleotide word a k-mer -- in genomic sequences, or some specific portion thereof, obey a Zipf’s law. Here we revisit the issue by studying the frequency versus ranking relations of a large number of complete genomes, and of parts of genomes having different biological functions. We show that the nucleotide composition has an influence on the frequency versus rank relation of a genomic sequence that is strong enough to mask whatever Zipf’s-law behavior the sequence may possess. Once this influence is removed, then all genomes obey the same broadly defined classes of Zipf’s laws, with the most important class-defining factor being the length of k-mers, or the integer k. For eukaryotes, the Zipf’s laws for the exonic and intronic segments of the genome differ significantly. Based on the observation that the Zipf’s law of a sequence is determined by the subset of k-mers having the highest frequencies (of occurrence), we derive a relation between the Zipf’s-law exponent and the high-frequency tail of the frequency distribution, and infer that for genomes in general the high-frequency tail is best represented by an exponential function, as opposed to linear, logarithmic, or power-law functions.

    摘要 .......................................................... i ABSTRACT ..................................................... ii 序 .......................................................... iii 誌謝 ......................................................... iv 1. 緒論(INTRODUCTION) ......................................... 1 1.1 生物訊息的載體 ...........................................................................................1 1.1.1 生命的起源..................................................................................................... 1 1.1.2 基因序列的構造............................................................................................. 2 1.2 基因序列的演化模式...................................................................................3 1.2.1 基因序列的突變與重組................................................................................. 3 1.2.2 自然選擇與物種分類..................................................................................... 5 1.3 隨機系統的特性...........................................................................................6 1.3.1 隨機的定義..................................................................................................... 6 1.3.2 中央極限定理................................................................................................. 7 1.4 齊普夫定律(Zipf law)與現象觀察..............................................................7 1.4.1 文字資訊的書目計量學(Bibliometrics)........................................................... 7 1.4.2 何謂齊普夫定律? .......................................................................................... 7 1.4.3 基因體序列的N 字串齊普夫定律.................................................................. 8 1.4.4 蛋白質表現的似齊普夫規則.......................................................................... 9 1.4.5 齊普夫定律無所不在.................................................................................... 10 1.5 齊普夫分佈的特性與應用.......................................................................10 1.5.1 最小努力原則(Principle of Least Effort)造成齊普夫分佈的魯棒性(robust) 11 1.5.1.1 Furusawa 建立簡單濃度擴散模式,2003 年.........................................11 1.5.1.2 Ogasawara 遺傳漂變和自然選擇的演化理論模型,2009 年................ 12 1.5.1.3 Bernat 運用算法信息論,模擬城市人口變動,2010 年...................... 13 1.5.1.4 其他例子................................................................................................... 14 1.5.2 尺度不變性與其冪次ζ ................................................................................. 14 1.5.2.1 氙Xe 的熱核碎裂,碎片分佈的冪次成氣液相變新依據..................... 15 1.5.2.2 基因表現量最大似然數分佈的冪次觀察癌症分類............................... 15 1.5.2.3 都市人口分佈、森林資源規模分佈與優化........................................... 18 1.5.3 訊息定量的Shannon 熵H 與冗數R....................................................... 19 1.5.3.1 基因體序列非編碼的含量影響影響熵與冗數....................................... 19 1.5.3.2 基因序列的G+C 含量影響結果? ......................................................... 20 1.5.4 序列模型中,齊普夫指數ζ與長程關聯指數α .......................................... 20 1.5.4.1 對照序列的長程關聯指數與齊普夫指數的邊界................................... 21 1.5.4.2 齊普夫與長短程關聯並沒有對等的關係............................................... 22 2. 材料與方法 (MATERIALS AND METHODS)......................... 24 2.1 完整的基因體序列 .....................................................................................24 2.2 基因序列的k 字串齊普夫子集解析法(k-tuple Zipf m-Set analysis).........24 2.2.1 滑動窗口與k 字串齊普夫解析法............................................................... 24 2.2.2 相對頻率....................................................................................................... 25 2.2.3 相對子集頻率.................................................................................................. 26 2.3 排名機率分佈直方圖 (Rank-Probability density function Histogram, RPDF Histogram) ..................................................................................................26 2.4 以2%為分界的高頻字與低頻字............................................................26 2.4.1 DNA 序列字串齊普夫子集圖與高頻字測試....................................28 2.4.1.1 齊普夫子集圖的函式測試....................................................................... 29 2.4.1.2 排名機率分佈(RPDF)的限制,以機率分佈(PDF)取代之................... 30 2.4.1.3 機率分佈的函式測試............................................................................... 31 3. 研究結果(RESULTS) ......................................... 35 3.1 不同 p 的基因體與對應隨機序列的3 字串齊普夫解析.........................35 3.1.1 齊普夫圖與齊普夫子集圖........................................................................... 35 3.1.2 隨機序列的齊普夫(子集)冪次................................................................... 37 3.1.3 排名機率分佈直方圖................................................................................... 37 3.1.4 隨機序列突顯齊普夫子集解析優勢........................................................... 37 3.2 以數學基礎比較相對頻率與相對子集頻率...........................................39 3.2.1 為何相對頻率的隨機序列有階梯狀?....................................................... 39 3.2.2 相對頻率的隨機序列k 字串有k+1 階梯................................................... 40 3.2.3 相對子集頻率的隨機序列只有一個階梯................................................... 41 3.3 齊普夫子集解析冪次的普適性...................................................................41 3.3.1 字串長度、物種分類與冪次關係............................................................... 43 3.3.2 序列長度、p 對解析冪次的影響................................................................ 43 3.3.3 依p 與長度範圍分成五個分類................................................................... 44 3.3.4 基因體序列、基因區、基因間隔區、外碼子、內碼子的齊普夫冪次........ 45 4. 討論(DISCUSSION) .......................................... 48 4.1 物種的冪次與演化關係...............................................................................48 4.2 相對子集頻率不受到序列的p 大小影響.................................................48 4.3 齊普夫子集圖的曲線.................................................................................48 4.3.1 低頻字的隨機性.....................................................................................48 4.3.2 對形式的分類無特別益處........................................................................... 49 4.4 齊普夫冪次與序列種類無關,與序列的p、長度有關..........................49 4.4.1 冪次無異於序列類,以長度log(L)=5.4, 6.2 當新分界編為九個分類.... 49 4.4.2 冪次在短序列中對p 有顯著的差異、對長度無特定大小依靠............... 50 4.4.3 物種的齊普夫冪次於不同類型序列探索................................................... 51 4.4.4 齊普夫冪次與序列種類無關....................................................................... 52 參考資料..................................................... 54 附表 ......................................................... 57

    1. Mantegna, R.N., et al., Linguistic Features of Noncoding DNA-Sequences. Physical Review Letters, 1994. 73(23): p. 3169-3172.
    2. Mantegna, R.N., et al., Systematic Analysis of Coding and Noncoding
    DNA-Sequences Using Methods of Statistical Linguistics. Physical Review E, 1995. 52(3): p. 2939-2950.
    3. Ramsden, J.J. and J. Vohradsky, Zipf-like behavior in procaryotic protein expression. Physical Review E, 1998. 58(6): p. 7777-7780.
    4. Li, W.T., Zipf’s Law in Importance of Genes for Cancer Classification Using Microarray Data. J. theor. Biol. , 2002 219: p. 539–551.
    5. Hernando, A., C. Vesperinas, and A. Plastino, Fisher information and the thermodynamics of scale-invariant systems. Physica A 2010 389(490-498).
    6. Tan, M.H.e.a., Relationship between Zipf dimension and fractal dimension of city-size distribution. . Geographical research, 2004 23(2): p. 243-248.
    7. Gong, X.Q. and Z. Wang, A Note on the Zipf’s Law. Complex Systems and Complexity Science 2008 5(3): p. 73-78.
    8. Bernat, C.M.e.a., Universality of Zipf’s law. Phys. Rev. E 2010 82: p. 011102.
    9. Yi, L.U., Analysis of forest resource scale usiong on Zipf’s law. Journal of Nanjing Forestry University (Natural Science Edition) 2009 33(2): p. 73-76.
    10. Chen, H.D., The Footprint of Evolution Duplication- Universal Equivallent Length of Genomes., in NCU. 2009
    11. Li, W.T., Zipf’s Law Everywhere. Glottometrics, 2003 5: p. 14-21.
    12. Tsay, M.Y., Information-metrics and Document properties 2003 Taipei: Hwa Tai Publishing.
    13. Zipf, G.K., Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology Addison-Wesley, Cambridge, MA, 1949.
    14. You, R.Y., Zipf’s Law and the Distribution of Chinese Character Frequency. .Journal of Chinese Information Processing, 1999. 14((3)): p. 60-65.
    15. Kosmidis, K., A. Kalampokis, and P. Argyrakis, Language time series analysis. Physica a-Statistical Mechanics and Its Applications, 2006. 370(2): p. 808-816.
    16. Manning, C.D.e.a., Foundations of Statistical Natural Language Processing. . 1999 MIT Press.
    17. Li, W.T., Random Texts Exhibit Zipf-Law-Like Word-Frequency Distribution. Ieee Transactions on Information Theory, 1992. 38(6): p. 1842-1845.
    18. Havlin, S., The Distance between Zipf Plots. Physica a-Statistical Mechanics and Its Applications, 1995. 216(1-2): p. 148-150.
    19. Cancho, R.F.I. and R.V. Sole, Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences of the United States of America, 2003. 100(3): p. 788-791.
    20. Ferrer-i-Cancho, R. and B. Elvevag, Random Texts Do Not Exhibit the Real Zipf''s Law-Like Rank Distribution. Plos One, 2010. 5(3): p. e9411.
    21. Bol´an, B.C.e.a., Statistical properties and linguistic coherence in noncoding DNA sequences. Rev. Mex. Fis. E, 2005. 51(2): p. 118–125.
    22. Flam, F., Hints of a Language in Junk DNA. Science, 1994. 266(5189): p.1320-1320.
    23. Konopka, A.K. and C. Martindale, Noncoding DNA, Zipf''s law, and language. Science, 1995. 268(5212): p. 789.
    24. Voss, R.F., Linguistic features of noncoding DNA sequences - Comment. Physical Review Letters, 1996. 76(11): p. 1978.
    25. Mantegna, R.N., S.V. Buldyrev, and A.L. Goldberger, Mantegna et al. Reply:. Phys. Rev. Lett. , 1996. 76, : p. 1979-1981.
    26. Furusawa, C. and K. Kaneko, Zipf''s law in gene expression. Physical Review Letters, 2003. 90(8)
    27. Ogasawara, O., S. Kawamoto, and K. Okubo, Zipf''s law and human transcriptomes: an explanation with an evolutionary model. Comptes Rendus Biologies, 2003. 326: p. 1097-1101.
    28. Ogasawara, O. and K. Okubo, On Theoretical Models of Gene Expression Evolution with Random Genetic Drift and Natural Selection. . Plos One, 2009. 4(11): p. e7943.
    29. Powers, M., Applications and Explanations of Zipf’s Law. new methods in language processing and computational natural language learning ACL, 1998 p. 151-160.
    30. A., A.L., Zipf’s law and the Internet. Glottometrics, 2002 3: p. 143-150.
    31. Stanley, H.E., et al., Scaling features of noncoding DNA. Physica a-Statistical Mechanics and Its Applications, 1999. 273(1-2): p. 1-18.
    32. Sellis, D. and Y. Almirantis, Power-laws in the genomic distribution of coding segments in several organisms: An evolutionary trace of segmental duplications, possible paleopolyploidy and gene loss. Gene, 2009. 447(1): p. 18-28.
    33. Han, D.D.e.a., Nuclear fragmentation may exist in the Zipf law. Chinese Science Bulletin 2000 45(9): p. 913-918.
    34. Bonhoeffer, S., et al., No signs of hidden language in noncoding DNA. Physical Review Letters, 1996. 76(11): p. 1977-1977.
    35. Peng, C.K., et al., Statistical Properties of DNA-Sequences. Physica a-Statistical Mechanics and Its Applications, 1995. 221(1-3): p. 180-192.
    36. Peng, C.K., et al., Mosaic Organization of DNA Nucleotides. Physical Review E, 1994. 49(2): p. 1685-1689.
    37. Peng, C.K., et al., Long-Range Correlations in Nucleotide-Sequences. Nature, 1992. 356(6365): p. 168-170.
    38. Peng, C.K., et al., Finite-Size Effects on Long-Range Correlations - Implications for Analyzing DNA-Sequences. Physical Review E, 1993. 47(5): p. 3730-3733.
    39. Buldyrev, S.V.e.a., Generalize Lévy-walk model for DNA nucleotide sequences. Phys. Rev. E 1993. 47(6): p. 4514-4523.
    40. Azbel’, M.Y., Random Two-Component One-Dimensional Ising Model for Heteropolymer Melting. . Phys. Rev. Lett., 1973. 31(9): p. 589-592.
    41. Czirok, A., et al., Correlations in Binary Sequences and a Generalized Zipf Analysis. Physical Review E, 1995. 52(1): p. 446-452.
    42. Voss, R.F., Evolution of Long-Range Fractal Correlations and 1/F Noise in DNA-Base Sequences. Physical Review Letters, 1992. 68(25): p. 3805-3808.
    43. Li, W.T., Expansion-Modification Systems - a Model for Spatial 1/F Spectra. Physical Review A, 1991. 43(10): p. 5240-5260.
    44. Li, W.T., Large-Scale Patterns in DNA Texts. . originally prepared for Scientific American, 1999: p. 1-10.
    45. Israeloff, N.E., M. Kagalenko, and K. Chan, Can Zipf distinguish language from noise in noncoding DNA? Physical Review Letters, 1996. 76(11): p. 1976-1976.
    46. Trotta, E., et al., 1H NMR study of [d(GCGATCGC)]2 and its interaction with minor groove binding 4'',6-diamidino-2-phenylindole. Journal of Biological Chemistry, 1993. 268(6): p. 3944-51.
    47. National center for biotechnology information genome database.
    48. Rice annotation project database.
    49. Hedges, S.B., The origin and evolution of model organisms. Nature Reviews Genetics, 2002. 3(11): p. 838-849.
    50. Hsieh, L.C., et al., Minimal model for genome evolution and growth. Physical Review Letters, 2003. 90(1): p. -.
    51. Chen, H.D., et al., Universal Global Imprints of Genome Growth and Evolution – Equivalent Length and Cumulative Mutation Density. PLoS ONE 2010. 5(4): p. e9844, 1-15.

    QR CODE
    :::