跳到主要內容

簡易檢索 / 詳目顯示

研究生: 黃雪玲
Shiue-Ling Huang
論文名稱: Anomaly Detection for PM2.5 Sensors via Transfer Learning
指導教授: 孫敏德
Min-Te Sun
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 50
中文關鍵詞: 空氣品質深度學習異常偵測
外文關鍵詞: Air quality, Deep learning, Anomaly detection
相關次數: 點閱:17下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 根據世界衛生組織估計,每年約有700萬人死於空氣汙染引發的相關疾病。在各種空氣汙染物中,PM2.5被認為是對人類影響最大的。為了監測周圍環境的PM2.5濃度,不同國家組織已經開始部署大量低成本的空氣品質感測器。然而,由於這些感測器的價格便宜,並且可能安裝在不適當的地方,因此某些空氣品質感測器的讀數可能不穩定。當使用PM2.5讀數進行數據分析時,應識別並清除這些不穩定的讀數。本文提出了一種基於深度學習的空氣品質感測器異常檢測系統。這項研究使用了兩個資料集,南海岸空氣品質管理區的PurpleAir和中央研究院的Airbox。雖然Airbox資料集中的PM2.5資料非常多,但是缺乏異常空氣品質感測器的標籤。相反,PurpleAir中氣品質感測器的分佈密度較低,但資料有室內和室外標籤。為了利用這兩個資料集,採用ADF框架標記Airbox資料集,將其用於訓練模型。然後,PurpleAir資料集用於遷移學習以重新訓練模型。PurpleAir測試集用於評估四個模型,包括來自遷移學習的LSTM模型和混合模型(將LSTM和XGBoost組合)以及僅使用PurpleAir資料集進行訓練的XGBoost和LSTM。實驗結果表明,遷移學習的過程有顯著提高了模型的性能,而且帶有遷移學習的混合模型在所有指標上均表現出最佳性能。


    According to the World Health Organization, approximately 7 million people die each year from diseases caused by air pollution. Among different types of air pollutants, PM2.5 is known to be the most fatal to humans. To monitor the PM2.5 readings in the surrounding environment, several organizations in different countries have initiated to deploy a large number of low-cost air quality sensors. However, because these sensors are cheaply built and may be installed at inappropriate places, the readings of some air quality sensors may be erratic. When PM2.5 readings are used for data analysis, these erratic readings should be identified and removed. In this thesis, we propose a deep learning-based anomaly detection system for air quality sensors. The study uses two datasets, PurpleAir from South Coast Air Quality Management District and Airbox from Academia Sinica. While PM2.5 data in Airbox dataset are abundant, they lack the ground truth for anomalous air quality sensors. On the contrary, the density of air quality sensors in PurpleAir is low, but their data come with indoor and outdoor labels. To take advantage of both datasets, the ADF framework is adopted to label the Airbox dataset, which is then used to train a model. Then, the PurpleAir dataset is used for transfer learning to retrain the model. The PurpleAir test set is used to evaluate four models, including LSTM model and hybrid model (combining LSTM and XGBoost) from transfer learning and the XGBoost and LSTM that are trained using only the PurpleAir dataset. The experimental results show that the process of transfer learning significantly improves the model performance, and the hybrid model with transfer learning exhibits the best performance in all metrics.

    1 Introduction 1 2 RelatedWork 5 2.1 Statistics-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Machine Learning-based approaches . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Non-parametric models . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Deep learning-based models . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Spatial correlation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Preliminary 9 3.1 Airbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Missing Data Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Data Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5 Recurrent Neural Network (RNN) . . . . . . . . . . . . . . . . . . . . . . . 12 3.6 Statistical Anomaly Detection Framework . . . . . . . . . . . . . . . . . . 14 3.7 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Design 18 4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.3 Spatial Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.4 Anomaly Detection Model Design . . . . . . . . . . . . . . . . . . . 24 4.2.5 Hybrid model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.6 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Performance 29 5.1 Experiment Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Model tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3.1 The BigTaipei dataset . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3.2 The San Francisco dataset . . . . . . . . . . . . . . . . . . . . . . . 37 5.4 Performance Comparison and Analysis . . . . . . . . . . . . . . . . . . . . 40 6 Conclusions 42 Reference 43

    [1] Mohd Mustafa Al Bakri Abdullah. Filling missing data using interpolation methods:
    Study on the effect of fitting distribution. Key Engineering Materials, 594-595:889–
    895, 01 2014.
    [2] Tahani Aljuaid and S. Sasi. Proper imputation techniques for missing values in data
    sets. pages 1–5, 08 2016.
    [3] Mennatallah Amer, Markus Goldstein, and Slim Abdennadher. Enhancing one-class
    support vector machines for unsupervised anomaly detection. pages 8–15, 08 2013.
    [4] Judith Amores, Pattie Maes, and Joe Paradiso. Bin-ary: detecting the state of organic trash to prevent insalubrity. In Kenji Mase, Marc Langheinrich, Daniel GaticaPerez, Hans Gellersen, Tanzeem Choudhury, and Koji Yatani, editors, Proceedings of
    the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2015 ACM International Symposium on Wearable Computers, UbiComp/ISWC Adjunct 2015, Osaka, Japan, September 7-11, 2015, pages
    313–316. ACM, 2015.
    [5] Thomas Bateson and Joel Schwartz. Children’s response to air pollutants. Journal
    of toxicology and environmental health. Part A, 71:238–43, 02 2008.
    [6] Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. A
    study of the behavior of several methods for balancing machine learning training
    data. SIGKDD Explor. Newsl., 6(1):20–29, June 2004.
    [7] Central Weather Bureau. Central weather bureau. https://www.cwb.gov.tw/V8/
    C/.
    [8] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41, 07 2009.
    [9] Fenxiao Chen, Yun-Cheng Wang, Bin Wang, and C.-C. Jay Kuo. Graph representation learning: a survey. APSIPA Transactions on Signal and Information Processing,
    9, 2020.
    [10] Ling-Jyh Chen, Yao Ho, Hsin-Hung Hsieh, Shih-Ting Huang, Hu-Cheng Lee, and
    Sachit Mahajan. Adf: an anomaly detection framework for large-scale pm2.5 sensing
    systems. IEEE Internet of Things Journal, 5(2):559–570, 2017.
    [11] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. pages
    785–794, 08 2016.
    [12] Kaohsiung City Council. Air pollution in kaohsiung is serious in autumn and winter,
    so schools are forbidden to hold sports games. https://www.kcc.gov.tw/News_
    Content.aspx?n=47&s=3748.
    [13] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions
    on Information Theory, 13(1):21–27, 1967.
    [14] M. da Silva Ferreira, L. F. Vismari, P. S. Cugnasca, J. R. de Almeida, J. B. Camargo,
    and G. Kallemback. A comparative analysis of unsupervised learning techniques for
    anomaly detection in railway systems. In 2019 18th IEEE International Conference
    On Machine Learning And Applications (ICMLA), pages 444–449, 2019.
    [15] South Coast Air Quality Management District. Purpleair: Real-time air quality
    monitoring. https://www2.purpleair.com/.
    [16] South Coast Air Quality Management District. Purpleair: Real-time
    air quality monitoring faq. https://www2.purpleair.com/community/faq#
    !hc-how-do-i-calibrate-my-purpleair-sensor-1.
    [17] Anthony Goldbloom. The home page of kaggle inc. https://www.kaggle.com, 2010.
    [18] Lovedeep Gondara and Ke Wang. Mida: Multiple imputation using denoising autoencoders, 2018.
    [19] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
    2016. http://www.deeplearningbook.org.
    [20] government’s open data. Dada.gov. https://www.data.gov/, 1997.
    [21] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing
    Shuai, Ting Liu, Xingxing Wang, Li Wang, Gang Wang, Jianfei Cai, and Tsuhan
    Chen. Recent advances in convolutional neural networks, 2017.
    [22] JA Hartigan and MA Wong. Algorithm AS 136: A K-means clustering algorithm.
    Applied Statistics, pages 100–108, 1979.
    [23] Douglas M. Hawkins. Identification of outliers / D.M. Hawkins. Chapman and Hall
    London ; New York, 1980.
    [24] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
    [25] R. Hu, C. C. Aggarwal, S. Ma, and J. Huai. An embedding approach to anomaly detection. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE),
    pages 385–396, 2016.
    [26] Wenjie Hu, Yihua Liao, and Rao Vemuri. Robust anomaly detection using support
    vector machines. Proceedings of the International Conference on Machine Learning,
    06 2003.
    [27] Guowen Huang, Ling-Jyh Chen, W.-H Hwang, S. Tzeng, and Hsin-Cheng Huang.
    Real-time pm2.5 mapping and anomaly detection from airboxes in taiwan. Environmetrics, 29, 2018.
    [28] R. Jain and H. Shah. An anomaly detection in smart cities modeled as wireless sensor
    network. In 2016 International Conference on Signal and Information Processing
    (IConSIP), pages 1–5, 2016.
    [29] D. Janakiram, A. V. U. P. Kumar, and A. M. Reddy V. Outlier detection in wireless
    sensor networks using bayesian belief networks. In 2006 1st International Conference
    on Communication Systems Software Middleware, pages 1–6, 2006.
    [30] Atsutoshi Kumagai, Tomoharu Iwata, and Yasuhiro Fujiwara. Semi-supervised
    anomaly detection on attributed graphs, 02 2020.
    [31] R. Kumar Dwivedi, S. Pandey, and R. Kumar. A study on machine learning approaches for outlier detection in wireless sensor network. In 2018 8th International
    Conference on Cloud Computing, Data Science Engineering (Confluence), pages 189–
    192, 2018.
    [32] D. Kwon, K. Natarajan, S. C. Suh, H. Kim, and J. Kim. An empirical study on
    network anomaly detection using convolutional neural networks. In 2018 IEEE 38th
    International Conference on Distributed Computing Systems (ICDCS), pages 1595–
    1598, 2018.
    [33] Chieh-Han Lee, Yeuh-Bin Wang, and Hwa-Lung Yu. An efficient spatiotemporal
    data calibration approach for the low-cost pm2.5 sensing network: A case study in
    taiwan. Environment International, 130:104838, 2019.
    [34] Yuan-Chien Lin, Wan-Ju Chi, and Yong-Qing Lin. The improvement of spatialtemporal resolution of pm2.5 estimation based on micro-air quality sensors by using
    data fusion technique. Environment International, 134:105305, 2020.
    [35] C. Y. Lo, W. H. Huang, M. F. Ho, M. T. Sun, L. J. Chen, K. Sakai, and W. S. Ku.
    Recurrent learning on pm2.5 prediction based on clustered airbox dataset. IEEE
    Transactions on Knowledge and Data Engineering, pages 1–1, 2020.
    [36] Cyuan-Heng Luo, Fu-Hsiang Ching, Yun-Jie Wang, Tzu-Heng Huang, and Ling-Jyh
    Chen. A study on calibrating air quality values between low-cost air quality sensors
    and professional testing stations., 2019.
    [37] Popescu Marius, Valentina Balas, Liliana Perescu-Popescu, and Nikos Mastorakis.
    Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and
    Systems, 8, 07 2009.
    [38] J. Murphree. Machine learning anomaly detection in large systems. In 2016 IEEE
    AUTOTESTCON, pages 1–9, 2016.
    [39] mySociety. Mapit:map costcodes and geographical points to administrative areas.
    https://global.mapit.mysociety.org/#1527220, 1997.
    [40] Radu Stefan Niculescu, Tom M. Mitchell, and R. Bharat Rao. Bayesian network learning with parameter constraints. Journal of Machine Learning Research,
    7(50):1357–1383, 2006.
    [41] D. Nielsen. Tree boosting with xgboost - why does xgboost win ”every” machine
    learning competition? 2016.
    [42] The official website of environmental protection administration. Environmental protection administration executive yuan, r.o.c.(taiwan). https://airtw.epa.gov.tw/
    CHT/default.aspx.
    [43] Special Interest Group on Knowledge Discovery in Data. Kdd cup archives. https:
    //www.kdd.org/kdd-cup, 1997.
    [44] Keith Ord. Outliers in statistical data: V. barnett and t. lewis, 1994, 3rd edition,
    (john wiley sons, chichester), 584 pp., £55.00, isbn 0-471-93094-6. International
    Journal of Forecasting, 12(1):175 – 176, 1996. Probability Judgmental Forecasting.
    [45] World Health Organization. Who global ambient air quality database (update 2018).
    https://https://www.who.int/airpollution/data/en/.
    [46] Ioannis Paschalidis and Yin Chen. Statistical anomaly detection with sensor networks. TOSN, 7, 08 2010.
    [47] E. L. Paula, M. Ladeira, R. N. Carvalho, and T. Marzag˜ao. Deep learning anomaly
    detection as support fraud investigation in brazilian exports and anti-money laundering. In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 954–960, 2016.
    [48] Michele Penza, Domenico Suriano, Valerio Pfister, Mario Prato, and Gennaro Cassano. Urban air quality monitoring with networked low-cost sensor-systems †. Proceedings, 1:573, 08 2017.
    [49] P. Priyanga S, K. Krithivasan, P. S, and S. Sriram V S. Detection of cyberattacks in industrial control systems using enhanced principal component analysis and
    hypergraph-based convolution neural network (epca-hg-cnn). IEEE Transactions on
    Industry Applications, 56(4):4394–4404, 2020.
    [50] J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81–106, March 1986.
    [51] Claude Sammut and Geoffrey I. Webb, editors. Logistic Regression, pages 631–631.
    Springer US, Boston, MA, 2010.
    [52] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph
    neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
    [53] Marco Schreyer, Timur Sattarov, Damian Borth, Andreas Dengel, and Bernd Reimer.
    Detection of anomalies in large scale accounting data using deep autoencoder networks. CoRR, abs/1709.05254, 2017.
    [54] K. M. Silva, B. A. Souza, and N. S. D. Brito. Fault detection and classification in
    transmission lines based on wavelet transform and ann. IEEE Transactions on Power
    Delivery, 21(4):2058–2063, 2006.
    [55] Academia Sinica. The home page of edigreen airbox. https://github.com/cclljj/
    TW-Civil-IoT-2020.
    [56] T. T. Teoh, G. Chiew, E. J. Franco, P. C. Ng, M. P. Benjamin, and Y. J. Goh.
    Anomaly detection in cyber security attacks on networks using mlp deep learning.
    In 2018 International Conference on Smart Computing and Electronic Enterprise
    (ICSCEE), pages 1–5, 2018.
    [57] John W. Tukey. Exploratory Data Analysis. Addison-Wesley, 1977.
    [58] Yanwen Wang, Yanjun Du, Jiaonan Wang, and Tiantian Li. Calibration of a low-cost
    pm2.5 monitor using a random forest model. Environment international, 133:105161,
    10 2019.
    [59] W. Wu, X. Cheng, M. Ding, K. Xing, F. Liu, and P. Deng. Localized outlying and
    boundary data detection in sensor networks. IEEE Transactions on Knowledge and
    Data Engineering, 19(8):1145–1157, 2007.
    [60] Z. Xiao, C. Liu, and C. Chen. An anomaly detection scheme based on machine
    learning for wsn. In 2009 First International Conference on Information Science and
    Engineering, pages 3959–3962, 2009.
    [61] M. Xie, J. Hu, S. Han, and H. Chen. Scalable hypergrid k-nn-based online anomaly
    detection in wireless sensor networks. IEEE Transactions on Parallel and Distributed
    Systems, 24(8):1661–1670, 2013.
    [62] Yu-Fei Xing, Yue-Hua Xu, Min-Hua Shi, and Yi-Xin Lian. The impact of pm2.5 on
    the human respiratory system. Journal of Thoracic Disease, 8(1), 2016.
    [63] Jerry Ye, Jyh-Herng Chow, and Jiang Chen. Stochastic gradient boosted distributed
    decision trees. pages 2061–2064, 01 2009.
    [64] B. Yegnanarayana. Artificial Neural Networks. Prentice-Hall of India Pvt.Ltd, 2004.

    QR CODE
    :::