| 研究生: |
王偉齊 Wei-Chi Wang |
|---|---|
| 論文名稱: |
以點雲數據做3D物件偵測、辨識、與方位估計的深度學習系統 3D Object Detection, Recognition, and Position Estimation using A Deep Learning System with Point Cloud Data |
| 指導教授: |
曾定章
Din-Chang Tseng |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 67 |
| 中文關鍵詞: | 3D物件偵測 、點雲 、注意力 、GIoU損失 、焦點損失 |
| 外文關鍵詞: | 3D object detection, point cloud, attention, GIoU loss, focal loss |
| 相關次數: | 點閱:16 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,深度學習技術的快速崛起,使得它在物件偵測與辨識的應用也逐漸成熟;物件偵測的技術也逐漸從2D擴展到3D應用層面;例如,自動駕駛、安全監控、人機互動、交通控制等。3D偵測要使用3D影像,當前的3D物件偵測受2D偵測方法的影響很大,為了利用2D偵測方法,他們通常將3D資料表示為規則網格(體素或上視圖),或依靠2D影像中的偵測來提出3D框,很少有方法試圖直接偵測3D點雲中的物體。由於點雲數據的稀疏性質,在直接根據場景點預測邊界框參數時,面臨著重大挑戰:3D對象中心可能遠離任何表面點,因此很難準確回歸。在本研究中,我們提出一個可直接估計3D物件位置、方向、與大小的卷積神經網路;透過輸入點雲資料,神經網路擷取特徵並預測物體的類別、位置、和方位角,最後輸出3D邊界框 (bounding box)。
本研究所使用的卷積神經網路模式是改自於3D偵測網路VoteNet。我們的主要改進分兩部份,一是將VoteNet中的 PointNet++ 架構加入注意力 (attention) 強化擷取特徵能力,並使用這些特徵進行偵測與辨識;二是修改損失函數 (loss function) ,加入GIoU (generalized intersection over union)損失函數、焦點損失 (focal loss) 函數,使神經網路可以更容易優化。
在實驗中,我們使用修改後的VoteNet做3D物件邊框估計與辨識。使用SUN RGB-D資料庫的7,870張影像,其中約80%為訓練樣本,其餘為測試樣本,我們在NVIDIA GeForce RTX 2080ti上進行訓練。
原始VoteNet物件偵測辨識系統平均執行速度為每秒11.20張影像,在IoU門檻值為0.25時mAP為66.13%,在門檻值為0.5時mAP為43.87%。經過一連串改動與實驗分析後,我們最終使用的網路架構,平均執行速度為每秒10.94張影像;與原始VoteNet網路相比,在IoU門檻值為0.25時mAP達到67.33%,約改進了1.81%;在門檻值為0.5時mAP為48.19%,約改進了9.84%。
Based on the rise of deep learning technology, its application in object detection and recognition gradually mature recently. Object detection technology has gradually developed from 2D to 3D application, like self-driving cars, security monitor, human–computer interaction, and traffic control. 3D images have depth information, but current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D data to regular grids (voxel grids or bird’s eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in 3D points. Due to the sparse nature of the data, we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately. In our research, we propose a neural network that can estimate directly the position and size of 3D objects. After inputting point clouds data, network extracts features, and model outputs 3D bounding boxes.
In our research, model we used are revised from the famous 3D detection network VoteNet. We made two improvements of the model. First, we use attention to enhance the ability to extract features. These features are used for detection and identify. Second, we revised the loss function. Adding GIoU loss function and focal loss function makes the model easier to optimize.
In the experiment, we used modified VoteNet to execute 3D bounding box estimation and recognition. There are 7,870 images in SUN RGB-D dataset, about 80% of which are training data and the others are testing data. We trained the model on NVIDIA GeForce RTX 2080ti.
The average execution speed of the previous VoteNet model is 11.20 frames per second, and the mAP is 66.13% with IoU threshold 0.25, 43.87% mAP with IoU threshold 0.5. After experimental analysis, the network architecture’s average execution speed is 10.94 frames per second. In comparison with the result of previous VoteNet network, our network can reach 67.33% mAP. It has improved by 1.81% with IoU threshold 0.25. With IoU threshold 0.5, our network can reach 48.19% mAP and has improved by 9.84%.
[1] M. Everingham, L. V. Gool, C. K. Williams, J. Winn, and A. Zisserman, ''The pascal visual object classes (voc) challenge,'' Int. Journal of Computer Vision (IJCV), vol.88, is.2, pp.303-338, 2010.
[2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, ''Microsoft coco: Common objects in context,'' arXiv:1405.0312.
[3] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li, ''Imagenet large scale visual recognition challenge,'' arXiv:1409.0575.
[4] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. of Neural Information Processing Systems (NIPS), Harrahs and Harveys, Lake Tahoe, NV, Dec.3-8, 2012, pp.1106-1114.
[5] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in Proc. of ECCV Conf., Zurich, Switzerland, Sep.6-12, 2014, pp.818-833.
[6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. of ICLR Conf., San Diego, CA, May.7-9, 2015, pp.1-14.
[7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, Jun.7-12, 2015, pp.1-9.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, Jun.27-30, 2016, pp.770-778.
[9] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Proc. of Neural Information Processing Systems (NIPS), Montréal, Canada, Dec.7-12, 2015, pp.2377-2385.
[10] C. R. Qi, O. Litany, K. He, and L. J. Guibas, '' Deep hough voting for 3D object detection in point clouds,'' in Proc. of IEEE Int. Conf. on Computer Vision (ICCV), Seoul, South Korea, Oct.27-Nov.2, 2019, pp.9276-9285.
[11] S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: a RGB-D scene understanding benchmark suite,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, Jun.7-12, 2015, pp.567-576.
[12] A. Neubeck and L. Van Gool, "Efficient non-maximum suppression," in Proc. of IEEE Int. Conf. on Pattern Recognition(ICPR), Hong Kong, China, Aug.20-24, 2006, pp.850-855.
[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Columbus, Ohio, Jun.23-28, 2014, pp.580-587.
[14] J. Uijlings, K. Sande, T. Gevers, and A. Smeulders, “Selective search for object recognition,” Int. Journal of Computer Vision (IJCV), vol.104, is.2, pp.154-171, 2013.
[15] L. Andreone, F. Bellotti, A. D. Gloria, and R. Lauletta, ''SVM-based pedestrian recognition on near-infrared images,'' in Proc. 4th IEEE Int. Symp. on Image and Signal Processing and Analysis, Torino, Italy, Sep.15-17, 2005, pp.274-278.
[16] R. Girshick, "Fast R-CNN," in Proc. of IEEE Int. Conf. on Computer Vision (ICCV), Santiago, Chile, Dec.11-18, 2015, pp.1440-1448.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.37, no.9, pp.1904-1916, 2015.
[18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.39, no.6, pp.1137-1149, 2016.
[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “ SSD: Single shot multibox detector,” in Proc. European Conf. on Computer Vision (ECCV), Amsterdam, Holland, Oct.8-16, 2016, pp.21-37.
[20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: unified, real-time object detection," in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp.779-788.
[21] J. Aceituno, R. Arnay, J. Toledo, and Leopoldo Acosta, “Using kinect on an autonomous vehicle for outdoors obstacle detection,“ IEEE Sensor Journal, vol.16, no.10, May 15, 2016.
[22] J. Choi, D. Kim, H. Yoo, and K. Sohn, “Rear obstacle detection system based on depth from Kinect,” in Proc. 15th Int. IEEE Conf. Intelligent Transportation Systems (ITSC), Anchorage, AK, Sep.16-19, 2012, pp.98-101.
[23] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, "Indoor segmentation and support inference from RBGD images," in Proc. European Conf. on Computer Vision (ECCV), Florence, Italy, Oct.7-113, 2012, pp.746-760.
[24] S. Gupta, R. Girshick, P. Arbel ́aez, and J. Malik, "Learning rich features from RGB-D images for object detection and segmentation," in Proc. European Conf. on Computer Vision (ECCV), Zurich, Switzerland, Sep.6-12, 2014, pp.345-360.
[25] S. Song and J. Xiao, “Deep sliding shapes for amodal 3D object detection in RGB-D images,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, Jun.27-30, 2016, pp.808-816.
[26] Z. Deng and L. J. Latecki, "Amodal detection of 3D objects: inferring 3D bounding boxes from 2D ones in RGB-Depth images," in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, Jul.21-26, 2017, pp.398-406.
[27] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, "PointNet: deep learning on point sets for 3D classification and segmentation," in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, Jul.21-26, 2017, pp.77-85.
[28] C. R. Qi, L. Yi, H. Su, K. Mo, and L. J. Guibas, "PointNet++: deep hierarchical feature learning on point sets in a metric space," arXiv:1706.0241.
[29] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, "Frustum pointnets for 3D object detection from RGB-D data," in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, Jun.18-23, 2018, pp.918-927.
[30] J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, Jun.18-23, 2018, pp.7132-7141.
[31] J. Fu, J. Liu, H. Tian, and Y. Li, "Dual attention network for scene segmentation," in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.15-20, 2019, pp.3141-3149.
[32] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, "Self-attention generative adversarial networks," arXiv:1805.08318v2.
[33] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: a metric and a loss for bounding box regression,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, June.15-20, 2019, pp.658-666.
[34] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” arXiv:1708.02002.
[35] D. P. Kingma, and J. Ba, “Adam: a method for stochastic optimization,” arXiv:1412.6980.