| 研究生: |
王志宇 Chih-Yu Wang |
|---|---|
| 論文名稱: |
基於深度學習之嵌入式即時行人偵測及追蹤系統 Using Deep Learning Real-Time Embedded System for Pedestrian Detection and Tracking |
| 指導教授: |
范國清
Kuo-Chin Fan 張陽郎 Yang-Lang Chang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 65 |
| 中文關鍵詞: | 深度學習 、行人偵測 、行人目標跟蹤 、樹莓派 |
| 外文關鍵詞: | Deep Learning, Pedestrian Detection, Pedestrian Tracking, Raspberry Pi |
| 相關次數: | 點閱:20 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著科技進步,無論在居家照護、監控、無人商店等領域,機器人都將扮演重要的角色。其中,基於視覺的行人跟蹤是十分重要的技術,可以讓機器人能夠不受範圍限制,跟隨人到處移動。本文所提出的行人跟蹤方法在僅具備1.2GHz ARM CPU和1GB RAM的Raspberry Pi 3上達到實時且可靠地運行,並且所需的硬體成本低廉(35美元),使得IoT應用能更加廣泛,例如智慧行李箱、自動購物車…等,這些應用在未來將改變我們的日常生活。
行人偵測器的準確性和速度對於可靠的跟蹤系統至關重要。然而,目前出色的深度學習目標偵測方法,例如Faster R-CNN、YOLO等方法,仍需耗費大量的運算資源和儲存空間,必須使用高階的CPU或GPU才能達到實時(30fps)運行,但在僅具備低階CPU或FPGA的嵌入式平台上仍然難以應用。因此,本文通過優化模型架構,並且使用訓練技巧輔助,得到了輕量並可靠的人體偵測模型Brisk-YOLO。並且,此偵測器在INRIA和PASCAL VOC公開資料集上與其他目標偵測方法相比,在保證行人偵測器的準確度下,將Tiny-YOLO加速了55倍,在Raspberry Pi 3上達到22fps。
此外,為了進一步節省計算量,偵測器並不逐幀運算,只有當跟蹤器誤差惡化或目標丟失時才使用檢測器進行修正。我們挑選了速度快的目標跟蹤器(Object Tracking)和行人再識別(Person Re-identification)方法,確保系統能穩定地運作。
通過在BoBoT數據集上實驗表明,本系統在現實的行人目標跟蹤場景中,其速度和準確率皆優於其他實時長時間跟蹤算法。
A vision-based person following method is important for various applications in human robot interaction (HRI). The accuracy and speed of person detector decide the performance of reliable person following system. However, state-of-the-art object detection based on CNNs such as YOLO require large memory and computational resources provided by high-end GPUs for real-time applications. They are unable to run on an embedded device with low-level CPUs or FPGA. Therefore, in this paper, a lightweight but reliable human detector Brisk-YOLO which developed by optimizing the model architecture and using training techniques. This method can reduce the computed quantity greatly and guarantees the accuracy of person detection.
In addition, in order to reduce the computation cost, the detector applies to every frame. It only applies in the beginning for initializing human target localization, alleviating the accumulated tracking error and on the events of object missing or occlusion. We have selected fast Object Tracking and Person Re-identification methods to ensure that system can run steadily.
The experimental results indicate that this system achieves real-time and reliable operation on the Raspberry Pi 3 with only 1.2GHz ARM CPU and 1GB of RAM in real-world person following scenario videos, and its accuracy is better than other long-term tracking methods. The proposed system can re-identify persons after periods of occlusion and distinguish a target from each other, even if they are looking similar. The BoBoT benchmark resulted in an average IoU of 73.39%, which is higher than state-of-the-art algorithms.
[1] F.-F. Li, A. Karpathy, and J. Johnson, “Spatial Localization and Detection,” p. 90, 2016.
[2] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, vol. 1, pp. 886–893 vol. 1.
[3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object Detection with Discriminatively Trained Part-Based Models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645, Sep. 2010.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
[5] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders, “Selective Search for Object Recognition,” Int. J. Comput. Vis., vol. 104, 2013.
[6] R. Girshick, “Fast R-CNN,” presented at the Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015.
[8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016, pp. 779–788.
[10] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” 2017, pp. 6517–6525.
[11] M. Lin, Q. Chen, and S. Yan, “Network In Network,” ArXiv13124400 Cs, Dec. 2013.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, May 2017.
[13] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” ArXiv14091556 Cs, Sep. 2014.
[14] C. Szegedy, S. Reed, D. Erhan, D. Anguelov, and S. Ioffe, “Scalable, High-Quality Object Detection,” ArXiv14121441 Cs, Dec. 2014.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016, pp. 770–778.
[16] S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” ArXiv151000149 Cs, Oct. 2015.
[17] S. Han et al., “DSD: Dense-Sparse-Dense Training for Deep Neural Networks,” ArXiv160704381 Cs, Jul. 2016.
[18] Y. Wu, J. Lim, and M.-H. Yang, “Online Object Tracking: A Benchmark,” in Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2013, pp. 2411–2418.
[19] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-Learning-Detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422, Jul. 2012.
[20] S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” in 2011 International Conference on Computer Vision, 2011, pp. 263–270.
[21] M. Felsberg et al., “The Thermal Infrared Visual Object Tracking VOT-TIR2016 Challenge Results,” in Computer Vision – ECCV 2016 Workshops, 2016, pp. 824–849.
[22] M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. van de Weijer, “Adaptive Color Attributes for Real-Time Visual Tracking,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1090–1097.
[23] M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate Scale Estimation for Robust Visual Tracking,” in DIVA, 2014.
[24] Y. Li and J. Zhu, “A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration,” in Computer Vision - ECCV 2014 Workshops, 2014, pp. 254–265.
[25] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-Speed Tracking with Kernelized Correlation Filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
[26] H. Nam and B. Han, “Learning Multi-Domain Convolutional Neural Networks for Visual Tracking,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4293–4302.
[27] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking,” in Computer Vision – ECCV 2016, 2016, pp. 472–488.
[28] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation filters,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 2544–2550.
[29] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the Circulant Structure of Tracking-by-Detection with Kernels,” in Computer Vision – ECCV 2012, 2012, pp. 702–715.
[30] R. R. Varior, M. Haloi, and G. Wang, “Gated Siamese Convolutional Neural Network Architecture for Human Re-identification,” in Computer Vision – ECCV 2016, 2016, pp. 791–808.
[31] Y. Du, H. Ai, and S. Lao, “Evaluation of color spaces for person re-identification,” in Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), 2012, pp. 1371–1374.
[32] D. Gray and H. Tao, “Viewpoint Invariant Pedestrian Recognition with an Ensemble of Localized Features,” in Computer Vision – ECCV 2008, 2008, pp. 262–275.
[33] F. Brémond, E. Corvee, S. B?k, and M. Thonnat, “Person Re-identification Using Haar-based and DCD-based Signature,” in 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance(AVSS), 2010, pp. 1–8.
[34] B. Prosser, W.-S. Zheng, S. Gong, and T. Xiang, “Person Re-Identification by Support Vector Ranking,” 2010, pp. 21.1-21.11.
[35] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Person re-identification by symmetry-driven accumulation of local features,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 2360–2367.
[36] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person Re-Identification by Local Maximal Occurrence Representation and Metric Learning,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2197–2206.
[37] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size,” ArXiv160207360 Cs, Feb. 2016.
[38] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
[39] P. Goyal et al., “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,” ArXiv170602677 Cs, Jun. 2017.
[40] L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp. 464–472.
[41] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A benchmark,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 304–311.
[42] D. Held, S. Thrun, and S. Savarese, “Learning to Track at 100 FPS with Deep Regression Networks,” in Computer Vision – ECCV 2016, 2016, pp. 749–765.
[43] P. Viola and M. J. Jones, “Robust Real-Time Face Detection,” Int. J. Comput. Vis., vol. 57, no. 2, pp. 137–154, May 2004.
[44] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
[45] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. Lecun, “Pedestrian Detection with Unsupervised Multi-stage Feature Learning,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3626–3633.
[46] P. Dollár, R. Appel, and W. Kienzle, “Crosstalk Cascades for Frame-Rate Pedestrian Detection,” in Computer Vision – ECCV 2012, 2012, pp. 645–659.
[47] R. Benenson, M. Mathias, R. Timofte, and L. V. Gool, “Pedestrian detection at 100 frames per second,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2903–2910.
[48] D. Klein, D. Schulz, S. Frintrop, and A. Cremers, “Adaptive real-time video-tracking for arbitrary objects,” in Proceedings of the ... IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE/RSJ International Conference on Intelligent Robots and Systems, 2010, pp. 772–777.
[49] S. Frintrop, “General object tracking with a component-based target descriptor,” in 2010 IEEE International Conference on Robotics and Automation, 2010, pp. 4531–4536.
[50] A. Kolarow et al., “Vision-based hyper-real-time object tracker for robotic applications,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 2108–2115.
[51] D. Kumlu and B. Gunsel, “Variable rate adaptive color-based particle filter tracking,” in 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 1679–1683.
[52] K. Nummiaro, E. Koller-Meier, and L. Van Gool, “An adaptive color-based particle filter,” Image Vis. Comput., vol. 21, no. 1, pp. 99–110, Jan. 2003.