| 研究生: |
吳禹岑 Yu-Tsen Wu |
|---|---|
| 論文名稱: |
以集成學習預測MLB投手是否具有獲得賽揚獎的潛力 |
| 指導教授: | 洪盟凱 |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
理學院 - 數學系 Department of Mathematics |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 中文 |
| 論文頁數: | 50 |
| 中文關鍵詞: | 美國職棒大聯盟 、集成學習 |
| 外文關鍵詞: | Major League Baseball, Ensemble Learning |
| 相關次數: | 點閱:18 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究的目的是預測MLB的投手是否具有獲得賽揚獎的潛力,研究方法是蒐集2008年到2022年(不包括2020年)球季每年7月31日以前的MLB投手數據集以及賽揚獎得主,以投手數據(出賽場數,先發出賽場數,勝投場數,敗投場數等)為自變數,是否獲得賽揚獎(未得獎者為0,得獎者為1)為應變數,投入集成學習的訓練與預測,並嘗試找出是否可以獲得賽揚獎和投手數據的特徵之間的關聯性。本研究在資料前置處理步驟進行了下列4項處理:
1.加入Sabermetrics的項目BsR。
2.將特徵W和ERA為遺漏值的樣本移除,並將特徵L, SV, SO/W為遺漏值的部份補0。
3.處理特徵IP的數值:0.1局改成1/3局,0.2局改成2/3局,以此類推。
4.加入目標項”Cy Young Award”(將不是賽揚獎得主的投手標示為0,賽揚獎得主的投手標示為1)。
The goal of this research is predicting that whether each pitcher in MLB has potential to win Cy Young Award or not. First step is collecting the dataset of pitching statistics in MLB of each year earlier than July 31 from 2008 to 2022 (without 2020), and the awardees of Cy Young Award in these years. This research let the pitching statistics (G, GS, W, L, ..., etc) and ”Cy Young Award”(0 for each pitcher who is not awardee and 1 for awardees) be the independent variable and dependent variable, respectively, use ensemble learning to train and predict, and try to find the relation of whether a pitcher winning Cy Young Award or not and the features in the dataset. This research did the following things in the step of data preprocessing:
1.Add ”BsR”(a subject of Sabermetrics) into the dataset.
2.Remove each sample whose ”W” or ”ERA” is NaN (not a number). If there is NaN in ”L”, ”SV”, or ”SO/W”, then fill 0 into it.
3.Deal with the data in ”IP”: replace 0.1 by 1/3 , replace 0.2 by 2/3, and so on.
4.Add the target, ”Cy Young Award”, into the dataset (let each sample of a pitcher who is not awardee be 0 and let the samples of awardees be 1).
[1] D. McClish, ”Analyzing a portion of the ROC curve”, 1989.
[2] E. Fix, and J. L. Hodges, Jr, ”Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties”, USAF School of Aviation Medicine, Randolph Field, Texas, 1951.
[3] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, ”Gene selection for cancer classification using support vector machines”, Mach. Learn., 46(1-3), 2002, pp. 389-422.
[4] J. Berkson, ”Application of the Logistic Function to Bio-assay”, Journal of the American Statistical Association, Vol.39, 1944, pp. 357-365.
[5] Kevin P. Murphy, ”Probabilistic Machine Learning: An introduction”, MIT Press, 2022, probml.ai.
[6] L. Breiman, J. Friedman, R. Olshen, and C. Stone, ”Classification and Regression Trees”, Wadsworth, Belmont, CA, 1984.
[7] P. Langley, W. Iba, K. Thompson, ”An Analysis of Bayesian Classifiers”, NASA Ames Research Center, USA, 1992.
[8] T. Fawcett, ”An introduction to ROC analysis”, Pattern Recognition Letters, 27(8), 2006, pp. 861-874.
[9] Y. Freund, and R. Schapire, ”A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting”, 1995.
[10] Baseball-Reference.com,取自 https://www.baseball-reference.com.
[11] MLB.com | The Official Site of Major League Baseball,取自 https://www.mlb.com.
[12] scikit-learn: machine learning in Python,取自 https://scikit-learn.org.
[13] Base runs - Wikipedia, Wikipedia,取自 https://www.wikipedia.org.
[14] Broyden-Fletcher-Goldfarb-Shanno algorithm - Wikipedia, Wikipedia,取自 https://www.wikipedia.org.
[15] Confusion matrix - Wikipedia, Wikipedia,取自 https://www.wikipedia.org.
[16] Cy Young Award - Wikipedia, Wikipedia,取自 https://www.wikipedia.org.
[17] Decision tree learning - Wikipedia, Wikipedia,取自 https://www.wikipedia.org.
[18] Moneyball - Wikipedia, Wikipedia,取自 https://www.wikipedia.org.
[19] Receiver operating characteristic - Wikipedia, Wikipedia,取自 https://www.wikipedia.org.
[20] Sabermetrics - Wikipedia, Wikipedia,取自 https://www.wikipedia.org.
[21] How to Deal With Imbalanced Classification and Regression Data, neptune.ai,取自 https://neptune.ai/blog/how-to-deal-with-imbalanced-classification-and-regression-data.
[22] 黃俊傑:〈以CART決策樹建構美國職棒投手年度表現與賽揚獎關聯性模式之分析〉,《運動健康休閒學報》,第二期,14-25頁。
[23] 鄭惟厚,胡學穎:《基礎統計》,初版,臺北市: 臺灣東華,民國一百年。