跳到主要內容

簡易檢索 / 詳目顯示

研究生: 徐瑋辰
Wei-Chern Hsu
論文名稱: A Survival Tree based on Stabilized Univariate Score Tests with High Dimensional Covariates
指導教授: 江村剛志
Takeshi Emura
口試委員:
學位類別: 碩士
Master
系所名稱: 理學院 - 統計研究所
Graduate Institute of Statistics
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 78
中文關鍵詞: 右設限高維度變數基因序列
外文關鍵詞: Right censoring, Tree, High dimensional covariate, Gene selection
相關次數: 點閱:9下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在醫學研究中,生物指標因素(prognosis factor)和其相對應的預測模
    型已經被廣泛使用。存活樹(Survival tree)和森林(Survival forest)是當
    前非常熱門用於存活數據(Survival data)開發預測模型的非參數方法。它們
    具有很高的彈性,可以合理地檢測某些變數間的交互作用而不需要太多模型
    假設。此外,一棵存活樹可以根據其二元分類及不斷遞迴的特性產生多個指
    標因素並將樣本分為多個組別。在本文中,我們點名的存活樹在高維度變數
    下的實施困難原因及解決辦法。此外,我們還指出,用於檢測樹節點在傳統
    logrank test 下具有致命的缺點。為了解決上述問題,我們提出了穩定單變
    量score statistics 來找出樹的節點。進階來說,我們可以在沒有任何迭代
    優化的情況下執行高維度變數的篩選和提出決策,在某些特殊運算下能提升
    效率。本文也提出對於當logrank test 無法提供適量的統計決策時,我們提
    出的方法能適當解決這個問題並產生更有預測能力的存活樹。


    Analysis of prognostic factors and prediction models has been considered extensively in
    medical research. Survival trees and forests are popular non-parametric tools for developing
    prognostic models for survival data. They offer great flexibility and can automatically detect
    certain types of interactions without the need to specify them beforehand. Moreover, a single tree
    can naturally classify subjects into different groups according to their survival prognosis based on
    their covariates. In this thesis, we point out the difficulty of tree-based model fitting a high
    dimensional covariate. Furthermore, we also point out that the traditional logrank tests for
    detecting the nodes of a tree have fatal drawbacks. In order to overcome these difficulties, we
    propose a stabilized univariate score statistics to find the nodes of a tree. We show that the high
    dimensional score tests can be performed without any iteration and optimization, leading to a
    computationally efficient test procedures. We also show that the proposed method can resolve the
    drawbacks of the logrank tests, leading to a highly precise tree. Simulation studies are performed
    to see the relative performance of the proposed method with the existing method. The lung cancer
    dataset is analyzed for illustration.

    Content 摘要...................................................................................................................................................i Abstract...........................................................................................................................................ii 致謝詞.............................................................................................................................................iii 1. Introduction................................................................................................................................ 1 2. Background................................................................................................................................. 3 2.1 Problem Setup:................................................................................................................. 3 2.2 Classification and Regression Tree................................................................................. 4 2.2.1 Introduction of Tree Algorithm........................................................................... 4 2.2.2 Splitting criterion .................................................................................................. 4 2.2.3 Stopping criterion ................................................................................................. 5 2.2.4 Logrank test........................................................................................................... 6 2.2.5 Score test .............................................................................................................. 10 3. Proposed method...................................................................................................................... 13 3.1 Univariate Score Test..................................................................................................... 13 3.2 Matrix-based computation ............................................................................................ 14 3.3 Survival tree algorithm.................................................................................................. 17 3.4 Prognostic Prediction..................................................................................................... 21 4. R package.................................................................................................................................. 24 4.1 uni.logrank ...................................................................................................................... 25 4.2 KM.split............................................................................................................................ 25 4.3 uni.tree............................................................................................................................. 26 4.4 feature.selected ................................................................................................................ 29 4.5 risk.classification............................................................................................................. 29 5. Simulations ............................................................................................................................... 29 5.1 Simulation designs.......................................................................................................... 30 5.2 Simulation result ............................................................................................................ 34 6. Data analysis............................................................................................................................. 37 6.1 The Lung Cancer data................................................................................................... 37 6.2 Binary splitting............................................................................................................... 38 6.3 Survival tree.................................................................................................................... 40 6.3.1 Logrank tree by uni.logrank() and uni.tree().................................................... 40 v 6.3.2 Modified score tree by uni.score() and uni.tree().............................................. 42 6.3.3 Conditional inference tree by ctree() ................................................................. 44 6.4 Analytic results............................................................................................................... 47 7. Conclusions............................................................................................................................... 50 Reference....................................................................................................................................... 51 Appendix....................................................................................................................................... 53 Appendix A: Performance Evaluation............................................................................... 53 A1. Tree model and notation settings......................................................................... 53 A2. Evaluation index.................................................................................................... 54 A3. c-index .................................................................................................................... 54 A4. Likelihood ratio test.............................................................................................. 56 Appendix B: Code for data analysis................................................................................... 58 Appendix C: Searching optimal threshold and constant 0 d to build an univariate tree for lung cancer data ............................................................................................................. 63 Appendix D: Optimal the adjust P-value for ctree() for lung cancer data ..................... 68

    Beer DG, Kardia SLR, Huang CC, Giordano TJ, Levin AM, et al. (2002) Gene-expression profiles predict
    survival of patients with lung adenocarcinoma. Nat Med 8: 816-824.
    Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. New York,
    US, Chapman and Hall.
    Chen HY, Yu SL, Chen CH, Chang GC, Chen CY, et al. (2007) A five-gene signature and clinical
    outcome in non-small-cell lung cancer. N Engl J Med 356: 11-20.
    Choi J, Oh I, Seo S, Ahn J (2018) G2Vec: Distributed gene representations for identification of cancer
    prognostic genes. Sci. Rep 8(1): 1-10.
    Emura T, Matsui S, Chen HY (2019) compound.Cox: univariate feature selection and compound covariate
    for predicting survival. Comput Methods Programs Biomed 168: 21-37
    Emura T, Chen YH, Chen HY (2012) Survival prediction based on compound covariate under Cox
    proportional hazard models. PLoS ONE 7 (10). doi:10.1371/journal.pone.0047627
    Emura T, Chen YH (2016) Gene selection for survival data under dependent censoring, a copula-based
    approach. Stat Methods Med Res 25(6): 2840-57.
    Emura T, Chen YH (2018) Analysis of survival data with dependent censoring, Copula-based approaches.
    JSS Research Series in Statistics, Springer, Singapore.
    Emura T, Hsu JH (2020) Estimation of the Mann-Whitney effect in the two-sample problem under
    dependent censoring Compt Stat Data Anal 150: 106990.
    Emura T, Nakatochi M, Matsui S, Michimae H, Rondeau V (2018) Personalized dynamic prediction of
    death according to tumour progression and high-dimensional genetic factors: meta-analysis with a joint
    model. Stat Methods Med Res 27(9): 2842-58
    Everitt BS, Howell DC (2005) Classification and regression trees, encyclopedia of statistics in behavioral
    science. Chichester, Wiley, Second Edition, pp. 287-290.
    Alvisi G, Brummelman J, Puccio S, Mazza EM, Tomadam EP, et al. (2020) IRF4 instructs effector Treg
    differentiation and immune suppression in human cancer. J Clin Invest 130(6): 3137-3150.
    Hothorn T, Everitt BS (2014). A Handbook of Statistical Analyses using R, Third Edition. CRC press.
    Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference
    framework. J Comput Graph Stat 15: 651-74.
    Hothorn T, Hornik K, Zeileis A (2020) ctree: Conditional Inference Trees. CRAN Version 1.2-8.
    https://cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf
    Hothorn T, Seibold H, Zeileis A (2020) partykit: A toolkit for Recursive Partytioning. CRAN Version 1.2-
    8.
    Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2(3):
    841-860.
    69
    Kang TH, Park JH, Yang A, Park HJ, Lee SE, et al. (2020) Annexin A5 as an immune checkpoint
    inhibitor and tumor-homing molecule for cancer treatment. Nat. Commun 11(1). doi:10.1038/s41467-
    020-14821-z.
    Kim M, Oh I, Ahn J (2018) An improved method for prediction of cancer prognosis by network learning.
    Genes 9(10): 478.
    LeBlanc M, Crowley J (1995) A review of tree–based prognostic models. Cancer Res Treat 75, 113-124.
    Matsui S (2015) Statistical issues in clinical development and validation of genomic signatures, design
    and analysis of clinical trials for predictive medicine. Boca Raton, CRC Press, pp. 207-226.
    Matsui S (2006) Predicting survival outcomes using subsets of significant genes in prognostic marker
    studies with microarrays. BMC bioinform 7(1): 156.
    Moradian H, Larocque D, Bellavance F (2019) Survival forests for data with dependent censoring. Stat
    Methods Med Res 28(2): 455-461.
    Mantel N, Bohidar NR, Ciminera JL (1977) Mantel-Haenszel analyses of litter-matched time-to-response
    data, with modifications for recovery of interlitter information. Cancer Res 37(11): 3863-3868.
    Shimokawa A, Kawasaki Y, Miyaoka E (2015). Comparison of splitting methods on survival tree. Int J
    Biostat 11(1): 175-188.
    Therneau TM, Atkinson EJ (2019) rpart: Recursive Partitioning and Regression Trees. CRAN Version
    4.1-15.
    Therneau TM, Lumley T (2020) survival: survival analysis. CRAN Version 3.1-12.
    van Wieringen WN, Kun D, Hampel R, Boulesteix L (2009). Survival prediction using gene expression
    data: a review and comparison. Comput Stat & Data Anal 53(5): 1590-1603.
    Witten DM, Tibshirani R (2010) Survival analysis with high-dimensional covariates. Stat Methods Med
    Res 19: 29-51.
    Yang SP, Emura T (2017) A Bayesian approach with generalized ridge estimation for high-dimensional
    regression and testing. Commun Stat-Simul 46 (8): 6083-105.

    QR CODE
    :::