使用強化學習模擬抑制新冠肺炎疫情｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	朱柏瑞 Po-Jui Chu
論文名稱：	使用強化學習模擬抑制新冠肺炎疫情 Simulations of Optimal Control of COVID-19 Pandemic Using Reinforcement Learning
指導教授：	陳健章 Chien-Chang Chen
口試委員:
學位類別：	碩士 Master
系所名稱：	生醫理工學院 - 生物醫學工程研究所 Graduate Institute of Biomedical Engineering
論文出版年：	2022
畢業學年度：	110
語文別：	中文
論文頁數：	113
中文關鍵詞：	新冠肺炎、深度學習、強化學習、流行病房室模型
外文關鍵詞：	COVID-19, Deep Learning, Reinforcement Learning, Compartmental models in epidemiology
相關次數：	點閱：14 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

COVID-19為新型冠狀病毒所造成的疾病，病毒學名為SARS-CoV-2，由2019年12月從中國武漢市發現不明原因肺炎群聚，且疫情隨後迅速擴散至全世界，是能有效人傳人的病毒。SARS-CoV-2病毒傳播快速且染病容易有嚴重症狀，對世界造成巨大影響，在沒有足夠疫苗前，需要大量醫療資源並配合減少人流移動的政策措施，才能有效抑制其傳播。減少SARS-CoV-2病毒傳播的政策包括邊境管制、強制或自願封城、隔離、限制社交距離、戴口罩與接種疫苗。這些政策透過限制人的活動和接觸有效抑止病毒散播，但過度的限制會對經濟造成影響。本研究的目標是運用強化學習技術A3C (Asynchronous Advantage Actor-Critic) 加上PPO (Proximal Policy Optimization) 來探索政策嚴謹度與經濟間的最佳平衡點，並對政策實施時機與人口密度不同造成傳染程度差異作分析。我們使用房室模型中的SEIR模型(Susceptible-Exposed-Infectious-Recovered model) 來模擬，並調整模型中各狀態(易受感染期、感染期、傳染期、復原或死亡期)間參數使模型中的基本傳染數與COVID-19基本傳染數相符合。在實驗中，我們針對日本四個都道府縣：北海道、沖繩、大阪、東京，使用其從2020年1月到2021年10月的確診數據做分析。在該期間的數據中有五個感染高峰，但對於封閉式的SEIR房室模型而言很難直接做出全景 (whole picture) 模擬。因此我們分別對應五個感染高峰架設五個相符的SEIR模型環境，然後再使用優化訓練好的代理 (agent) 與這五個環境互動以達成目標。訓練時使用 i9-10980XE 18核36執行緒，RTX 3090 24GB GPU，並使用A3C技術在主機的多執行緒中採用18個workers。實驗中發現平均獎勵隨著訓練上升，並在500回合後趨緩。結果顯示訓練好的代理能有效抑制住確診數上升，由代理所提供的動作策略發現其實施嚴謹政策時機為三個時機點：傳染者數增長當下、傳染者數增長後、傳染者數維持不變時；在風險高的區域平均以實施嚴謹政策較多，並在確診數下降時放寬政策。實驗中也發現在人口密度的計算上得知人口加權密度更能代表一區域的人口密集程度，在研究病毒在區域中的傳染力上用人口加權密度較為精準。在我們模型中更改SEIR模型，新增隔離者Q (Quarantined)，形成SEIQR模型，從實驗中得知能藉由更改SEIR模型來模擬各種情形，並可以用來模擬不同傳染病；然而我們訓練好的代理是否能廣泛運用在不同傳染病，需要看環境給予的States是否相同才能做判斷。而若能將環境給予的States泛化，找到每個傳染病共同的必要資訊，且這些資訊足夠讓代理判斷是否採取嚴謹政策，那麼就能這些資訊構建出適合流行病學的獎勵函數，並能訓練出適用於傳染病學的代理，這是未來能加以研究主題。

Novel coronavirus (COVID-19) disease is an infectious disease caused by the SARS-CoV-2 virus. COVID-19 originated at Wuhan city of China in early December 2019, and the epidemic quickly spread to the world. It is a Human-to-human transmission. SARS-CoV-2 spreads rapidly and is prone to severe symptoms after infection. It has led to a great impact on the world. In the absence of an adequate vaccine, significant medical resources and policies to limit human movement and contact such as restriction on gathering will be needed to mitigate the epidemic. Policies to reduce the spread of SARS-CoV-2 include border controls, mandatory or voluntary lock-down, quarantines, social distancing, mask-wearing, and vaccination. These measures are effective by restricting human movement and contact; however, the economy is seriously impacted as well. We focus on exploring the optimal balance between policy stringency and economy using Reinforcement Learning (RL): Asynchronous Advantage Actor-Critic + Proximal Policy Optimization. We use the compartmental SEIR model to train the agent and adjust the parameters of each state: suspected, exposed, infected, and removed. The parameters of these four states make the basic reproduction number in the SEIR correspond with the basic reproduction number of COVID-19 . In the experiment, we focus on the four prefectures in Japan – Hokkaido, Okinawa, Osaka, and Tokyo – and use the tested positive cases data from January 2020 to October 2021. There are five infection peaks in the data. For the compartmental SEIR model, it is difficult to make the whole picture simulations directly like the real situation. Hence, we create five environments to simulate these peaks then use an optimally trained agent to interact with these environments to reach the goal. We use CPU: i9-10980XE with 18 cores and 36 threads & GPU: RTX 3090 2ith 24GB to train the agent. With 18 workers for multi-threading on the A3C during training, the average reward rises with training and plateaus after 500 episodes. The results show that the optimal agent can effectively suppress the increase in the active cases. We also find the agent implement strict policies when the number of infected cases increase, continue increasing for a few days, or remain unchanged. These strict policies are implemented in high-risk areas on average. Finally, weighted population density can better represent the density of population in the area compared to traditional population density, hence it is more accurate to use population weighted density for pandemic infectivity studies. We change the SEIR model and add the Quarantined (Q) to form SEIQR model. Learned from the experiment that we can simulate various situations and various epidemic diseases by changing traditional SEIR model. However, whether our trained agent can be generally used in different epidemic diseases depends on the states that the environment gave. If we can generalize these states from the different epidemiological environment to find the necessary and crucial information which is sufficient for the agent to judge whether to implement strict policies, we can construct a general epidemiologically reward function with this information and train the agent to apply it to different epidemic diseases.

目錄
中文摘要     ………………………………………………………………i
英文摘要     ………………………………………………………………iii
致謝     ………………………………………………………………v
目錄     ………………………………………………………………vi
圖目錄     ………………………………………………………………viii
表目錄     ………………………………………………………………xii
一、緒論     ………………………………………………………………1
1-1  研究動機 ………………………………………………………………1
1-2  研究目的 ………………………………………………………………3

二、強化學習架構與房室模型 ………………………………………………………………3
2-1  強化學習架構 ………………………………………………………………3
2-1-1 強化學習與馬可夫決策過程 ………………………………………………………………3
2-1-2 代理 (Agent)  ………………………………………………………………6
    1. Actor-Critic methods     ………………………………………………………………7
    2. Asynchronous Advantage Actor-Critic (A3C)  ……………………8
    3. Generalized Advantage Estimation (GAE)  ……………………10
    4. Proximal Policy Optimization (PPO) ……………10
2-2  流行病房室模型 ………………………………………………………………12
2-2-1 SEIR model  ………………………………………………………………12
2-3  神經網路 ………………………………………………………………13
2-3-1 Long short-term memory (LSTM) ……………………………………………………13
2-3-2 Bidirectional Long short-term memory (Bidirectional LSTM) ………………………………………………………………14

三、研究內容與方法 ………………………………………………………………15
3-1  環境設計 ………………………………………………………………15
3-2  真實環境模擬 ………………………………………………………………22
3-3  代理設計結合真實環境 ………………………………………………………………35
3-4  代理訓練 ………………………………………………………………41

四、結果與分析 ………………………………………………………………44
4-1  結果呈現 ………………………………………………………………44
4-2  分析 ………………………………………………………………53
4-2-1 分析1 -- 將代理執行的動作做分級 ………………………53
4-2-2 分析2 -- 代理執行動作離群值時機點與此時傳染者變化 ………………70
4-2-3 分析3 -- 在無代理介入下PWD與PD傳染者差異 ……………79
4-2-4 分析4 -- 在動作分級在有效傳染數與經濟上變化 …………………82

五、結論     ………………………………………………………………93
參考文獻     ………………………………………………………………95

                                

參考文獻
[1] WHO: COVID-19 likely to shrink global GDP by almost one per cent in 2020, accessed from https://www.un.org/zh/desa/covid-19-likely-shrink-global-gdp-almost-one-cent-2020
[2] Ohi, A. Q., Mridha, M. F., Monowar, M. M., & Hamid, M. (2020). Exploring optimal control of epidemic spread using reinforcement learning. Scientific reports, 10(1), 1-19.
[3] Kinoshita, R., Jung, S. M., Kobayashi, T., Akhmetzhanov, A. R., & Nishiura, H. (2022). Epidemiology of coronavirus disease 2019 (COVID-19) in Japan during the first and second waves. Mathematical Biosciences and Engineering, 19(6), 6088-6101.
[4] Li, Y. (2017). Deep Reinforcement Learning: An Overview.. CoRR, abs/1701.07274.
[5] Bellman, R. (1957). A Markovian decision process. Journal of mathematics and mechanics, 679-684.
[6] Ivanov, S. & D'yakonov, A. (2019). Modern Deep Reinforcement Learning Algorithms.. CoRR, abs/1906.10025.
[7] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). PMLR.
[8] Schulman, J., Moritz, P., Levine, S., Jordan, M. I. & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation.. In Y. Bengio & Y. LeCun (eds.), ICLR (Poster), .
[9] Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal Policy Optimization Algorithms.. CoRR, abs/1707.06347.
[10] Kermack, W. O., & McKendrick, A. G. (1927). A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character, 115(772), 700-721.
[11] Carcione, J. M., Santos, J. E., Bagaini, C., & Ba, J. (2020). A simulation of a COVID-19 epidemic based on a deterministic SEIR model. Frontiers in public health, 230.
[12] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
[13] Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., & Xu, B. (2016, August). Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers) (pp. 207-212).
[14] WHO: Coronavirus disease (COVID-19) symptoms, accessed from https://www.who.int/health-topics/coronavirus#tab=tab_3
[15] Billah, M. A., Miah, M. M., & Khan, M. N. (2020). Reproductive number of coronavirus: A systematic review and meta-analysis based on global level evidence. PloS one, 15(11), e0242128.
[16] Gallagher, J. (2021). Covid: Is there a limit to how much worse variants can get. How the R0 Numbers of COVID-19 Variants and Other Diseases Compare.
[17] Liu, Y., & Rocklöv, J. (2021). The reproductive number of the Delta variant of SARS-CoV-2 is far higher compared to the ancestral SARS-CoV-2 virus. Journal of travel medicine.
[18] Toyokeizai: Coronavirus Disease (COVID-19) Situation Report in Japan, accessed from https://toyokeizai.net/sp/visual/tko/covid19/en.html
[19] Achaiah, N. C., Subbarajasetty, S. B., & Shetty, R. M. (2020). R0 and re of COVID-19: can we predict when the pandemic outbreak will be contained?. Indian journal of critical care medicine: peer-reviewed, official publication of Indian Society of Critical Care Medicine, 24(11), 1125.
[20] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. & Wierstra, D. (2016). Continuous control with deep reinforcement learning.. In Y. Bengio & Y. LeCun (eds.), ICLR, .

簡易檢索 / 詳目顯示

相關論文