| 研究生: |
黃啟軒 Chi-Hsuan Huang |
|---|---|
| 論文名稱: |
利用虛擬資料建構深度學習訓練集以實現凌空書寫應用 Using Synthetic Data to Construct Deep Learning Datasets for Air-Writing Applications |
| 指導教授: | 蘇柏齊 |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 58 |
| 中文關鍵詞: | 指尖偵測 、凌空書寫 、合成資料 、文字辨識 |
| 外文關鍵詞: | Fingertip detection, air-writing, synthetic datasets, character recognition |
| 相關次數: | 點閱:19 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
凌空書寫是一項新穎的人機互動輸入方式,使用者自然地在空中書寫想要輸入於若干機器或設備的文字,藉由攝影機所拍攝的畫面中進行即時指尖偵測,將指尖座標點形成軌跡,進而辨識該軌跡所代表的文字。凌空書寫可做為如智慧型眼鏡的文字輸入方法,非接觸式的書寫方式也能使用於若干衛生敏感場域,例如降低在醫院的使用者因接觸設備而感染病毒的風險。本研究旨在提出基於深度學習之第一人稱以及第三人稱凌空書寫技術。由於深度學習技術的使用需仰賴大量標記資料,我們選擇以Unity3D建立訓練資料集,將所建構的手部虛擬模型合成於隨機影像或單一顏色背景中,藉此有效且快速地生成標記合成資料。我們利用手部模型的改變,模擬書寫過程中的旋轉以及移動來增加資料多樣性。在較複雜的第三人稱場景中,我們更加入隨機變換的人臉以及人體軀幹讓虛擬資料更接近真實情況。我們利用物件偵測模型偵測指尖位置以形成文字軌跡,並刪除書寫過程中所產生的冗餘筆跡,讓處理後筆跡更貼近文字本身。我們結合手寫字與印刷字形成綜合資料集訓練文字辨識模型,採用ResNeSt架構來辨識近5000個中文字。實驗結果顯示我們所產生的大量且精準標記合成資料可有效訓練模型,協助實現包括第一與第三人稱的即時凌空書寫。
Air-writing is the practice of waving a finger in the air to write a character. Through the real-time fingertip detection from frames of captured videos, the trajectory of fingertip can be formed for character recognition. Air-writing may thus serve as a new human-computer interface to input texts for such facilities as smart glasses or computers requiring touchless operations. This research aims to propose deep-learning techniques for first-person and third-person air-writing. We first employed Unity3D to synthesize the hand model, which is superimposed onto randomly chosen images or single-color background to generate labeled data. The object detection model is trained accordingly to detect the fingertip positions. The trajectory can then be extracted to form a single-stroke character, and post-processing is applied to remove redundant connections within a character. A dataset containing handwritten and printed characters is built for training a classification model. The experimental results show that the large volume of high-quality labeled data can effectively train the model realizing the first- and third-person air writing.
[1] S. Ren, K. He, R. Girshick, and J. Sun. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June 2017.
[2] Vincent Girondel, Laurent Bonnaud, and Alice Caplier. "A human body analysis system." EURASIP journal on advances in signal processing 2006.
[3] Leonid Sigal, Stan Sclaroff, and Vassilis Athitsos. "Skin color-based video segmentation under time-varying illumination. " IEEE Trans. Pattern Anal. Mach. Intell 2004.
[4] Martin de La Gorce, David J. Fleet, and Nikos Paragios. "Model-Based 3D Hand Pose Estimation from Monocular Video." In IEEE Transactions on Pattern Analysis and Machine Intelligence 2011, 33(9), 1793-1805.
[5] Philip Krejov and Richard Bowden. "Multi-touchless: Real-time fingertip detection and tracking using geodesic maxima." 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai 2013, pp. 1–7.
[6] Hui Liang, Junsong Yuan, and Daniel Thalmann. "3D Fingertip and Palm Tracking in Depth Image" Sequences.MM 2012 - Proceedings of the 20th ACM International Conference on Multimedia 2012, pp. 785–788.
[7] Chia-Ping Chen, Yu-Ting Chen, Ping-Han Lee, Yu-Pao Tsai, and Shawmin Lei. "Real-time hand tracking on depth images." In Visual Communications and Image Processing (VCIP), 2011 IEEE 2011, pp. 1–4.
[8] J. S. Supancic, III, Grégory Rogez, Yi Yang, Jamie Shotton, and Deva Ramanan. "Depth-based hand pose estimation: Data, methods, and challenges." In The IEEE International Conference on Computer Vision (ICCV) 2015.
[9] Jonathan Tompson, Murphy Stein, Yann LeCun, and Ken Perlin. "Real-time continuous pose recovery of human hands using convolutional networks." ACM Transactions on Graphics (TOG) 2014, 33(5), 169.
[10] Lorenzo Baraldi, Francesco Paci, Giuseppe Serra, Luca Benini, and Rita Cucchiara. "Gesture recognition in ego-centric videos using dense trajectories and hand segmentation." In Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference 2014, pp. 702–707.
[11] Aaron Wetzler, Ron Slossberg, and Ron Kimmel. "Rule of thumb: Deep derotation for improved fingertip detection." arXiv:1507.05726 2015.
[12] Sven Bambach,Stefan Lee, David J. Crandall, and Chen Yu. "Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions." In Proceedings of the IEEE International Conference on Computer Vision 2015, pp. 1949–1957.
[13] Chi Xu, Wendi Cai, Yongbo Li, Jun Zhou, and Longsheng. "Accurate Hand Detection from Single-Color Images by Reconstructing Hand Appearances." Sensors 2020, 20(1), 192.
[14] Xiaorui Liu, Yichao Huang, Xin Zhang, and Lianwen Jin. "Fingertip in the Eye: A cascaded CNN pipeline for the real-time fingertip detection in egocentric videos." arXiv:1511.02282 2015.
[15] Yichao Huang, Xiaorui Liu, Xin Zhang , and Lianwen Jin. "A Pointing Gesture Based Egocentric Interaction System: Dataset, Approach and Application." 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Las Vegas, NV 2016, pp. 370–377.
[16] Sohom Mukherjee, Sk. Arif Ahmed, Debi Prosad Dogra, Samarjit Kar, and Partha Pratim Roy. "Fingertip Detection and Tracking for Recognition of Air-Writing in Videos." arXiv:1809.03016 2018.
[17] Mohammad Mahmudul Alama, Mohammad Tariqul Islamb, and S. M. Mahbubur Rahmanc. "Unified Learning Approach for Hand Gesture Recognition and Fingertip Detection." arXiv:2101.02047 2021.
[18] A. Gupta, A. Vedaldi, and A. Zisserman, "Synthetic Data for Text Localisation in Natural Images." 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 2315-2324.
[19] Wang, Qi, et al. "Learning from synthetic data for crowd counting in the wild." Proceedings of the IEEE conference on computer vision and pattern recognition. 2019.
[20] Liu, Ziwei, et al. "Large-scale celebfaces attributes (celeba) dataset." Retrieved August 15 (2018): 2018.
[21] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee, "What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis." In Proceedings of the IEEE international conference on computer vision, 2019.
[21] Zhang, Hang, et al. "Resnest: Split-attention networks." arXiv preprint arXiv:2004.08955 (2020).
[22] Traditional Chinese Handwriting Dataset.
https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset.Accessed: 2021-01-14.
[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition." 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV 2016, pp. 770–778.
[24] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. "Densely Connected Convolutional Networks." 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI 2017, pp. 2261–2269.
[25] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA 2015, pp. 1–9.
[26] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. "Feature Pyramid Networks for Object Detection." 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI 2017, pp. 936–944.
[27] Fanqing Lin and Tony. "Ego2Hands: A Dataset for Egocentric Two-hand Segmentation and Detection." arXiv:2011.07252.
[28] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. "Mask R-CNN." IEEE International Conference on Computer Vision (ICCV) 2017, pp. 2961–2969.
[29] O. Ronneberger, P. Fischer, and T. Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation." In MICCAI, 2015.
[30] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: "A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation." IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:2481–2495, 1 2017.
[31] W. Wang, K. Yu, J. Hugonot, P. Fua, and M. Salzmann. "Recurrent U-Net for Resource-Constrained Segmentation." In ICCV, 2019.
[32] W. Wu, C. Li, Z. Cheng, X. Zhang, L. Jin, "Yolse: Egocentric fingertip detection from single rgb images." in: Proceedings of the IEEE Int. Conf. on Computer Vision, Venice, Italy, 2017, pp. 623–630.
[33] P. Mishra and K. Sarawadekar. "Fingertips detection in egocentric video frames using deep neural networks." in: Proc.Int. Conf. on Image and Vision Computing New Zealand (IVCNZ), IEEE, Dunedin, New Zealand, 2019, pp.1–6.
[34] Henriques, J. F., Caseiro, R., Martins, P., and Batista, J. "High-speed tracking with kernelized correlation filters." IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
[35] Kalal, Z., Mikolajczyk, K., and Matas, J. "Tracking-learning-detection." IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.
[36] Babenko, B., Yang, M.-H., and Belongie, S. "Robust object tracking with online multiple instance learning." IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1619–1632, 2011.
[37] Cohen Gregory, Afshar Saeed, Tapson Jonathan, and Schaik Andre Van. "EMNIST: Extending MNIST to handwritten letters." In 2017 International Joint Conference on Neural Networks (IJCNN), 2017.
[38] C. Chiang, R. Wang, and B. Chen, "Recognizing arbitrarily connected and superimposed handwritten numerals in intangible writing interfaces." Pattern Recognition, vol. 61, pp. 15–28, Jan. 2017.
[39] T. Chu and C. Su, "A Kinect-Based Handwritten Digit Recognition for TV Remote Controller." IEEE International Symposium on Intelligent Signal Processing and Communications Systems, pp.414-419, 2012.
[40] F. Huang, C. Su, and T. Chu, "Kinect-Based Bid-Air Handwritten Digit Recognition using Multiple Segments and Scaled Coding." IEEE International Symposium on Intelligent Signal Processing and Communications Systems, pp. 694-697, Nov. 2013.
[41] T. Murata and J. Shin, "Hand Gesture and Character Recognition Based on Kinect Sensor." International Journal of Distributed Sensor Networks, vol. 10, Jul. 2014, [online] Available: http://dx.doi.org/10.1155/2014/543278460.
[42] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. "Aggregated residual transformations for deep neural networks." arXiv preprint arXiv:1611.05431, 2016.
[43] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 779-788.