| 研究生: |
林香岑 Hsiang-Tsen Lin |
|---|---|
| 論文名稱: |
基於Transformer架構的多行人追蹤 Transformer based multiple pedestrian tracking |
| 指導教授: | 施國琛 |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 66 |
| 中文關鍵詞: | 多物件追蹤 |
| 相關次數: | 點閱:9 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
多物件追蹤在機器學習領域中是一個十分熱門的研究項目,其最大的挑戰在於行人重疊時系統的穩定性。多數的解決方式即是擷取前幾幀中物件的外觀特徵或動作特徵來關聯當前幀該物件和前幾幀物件的相似性,再透過匈牙利演算法後處理將所有物件做匹配。
而在2021年初,MIT、Facebook、Google等將自然語言領域中較為熱門的架構,Transformer,帶入物件追蹤議題,其準確度高於現有模型,使得大家開始爭相研究。雖然Transformer的引入似乎引起了一陣熱潮,吸引大家使用該架構,但相對Transformer所需的龐大訓練資料集以及記憶體空間也讓許多研究員或學者為之頭疼。
2021年年中,發表了第一篇以Transformer為架構的端對端模型,雖然準確度並不是最頂尖,但其架構簡單、將以往需人工設計的data association包含進架構中,減少人工設計函式的誤差,使的其架構以更直觀的方式作呈現。但也因其端對端模型的設計,導致物件偵測及其後續的物件追蹤有著強烈的牽制。在本文中我們提出一種新的想法,使用YOLOv5模型的輸出資訊做為輸入資料,以輔助Transformer,增加訓練時的模型穩定度,不僅可以加快模型的收斂速度,也可藉此減少transformer的堆疊層數,降低GPU記憶體的需求量,使得單GPU的使用者也可以輕鬆訓練。
Multiple object tracking is a popular research project in the field of machine learning. The biggest challenge is stability when objects overlap. Most of the solutions are to extract the appearance features or motion features in the previous frames to correlate the similarity between the current frame and the previous frames, then match objects by Hungarian Algorithm.
At the beginning of 2021, MIT, Facebook, Google, etc. brought the popular architecture in the natural language field, Transformer, into object tracking issue. Although it seems to have caused a wave of enthusiasm, attracting everyone to use this architecture, the large training dataset and memory space required by Transformer do not look friendly to researchers or scholars.
In mid-2021, the first end-to-end model with Transformer as the architecture was published. Although the accuracy is not the best, its architecture is simple, and the data associations that used to be manually designed are included in the architecture, which means we can reduce the need for manual design, can be presented more intuitively. However, object detection and object tracking are strongly hindered due to the model's design. In this paper, we propose a new idea to use the output information of the YOLOv5 model as input data to assist the Transformer and increase the model stability during training, which can not only speed up the convergence of the model, but also reduce the stacking layers of the transformer. This reduces the amount of GPU memory required so that users with a single GPU can easily train.
[1] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," in Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[2] Ross Girshick, "Fast R-CNN," in International Conference on Computer Vision (ICCV), 2015.
[3] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in Transactions on Pattern Analysis and Machine Intelligence, pp. 1137 - 1149, 2017.
[4] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[5] Joseph Redmon, Ali Farhadi, "YOLO9000: Better, Faster, Stronger," in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[6] Joseph Redmon, Ali Farhadi, "YOLOv3: An Incremental Improvement," in arXiv:1804.02767, 2018.
[7] Cui Gao, Qiang Cai, Shaofeng Ming, "YOLOv4 Object Detection Algorithm with Efficient Channel Attention Mechanism," in International Conference on Mechanical, Control and Computer Engineering (ICMCCE), 2020.
[8] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko, "End-to-End Object Detection with Transformers," in European Conference on Computer Vision (ECCV), 2020.
[9] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai, "Deformable DETR: Deformable Transformers for End-to-End Object Detection," in The International Conference on Learning Representations (ICLR), 2021.
[10] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei, "Deformable Convolutional Networks", in IEEE International Conference on Computer Vision (ICCV), 2017.
[11] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, Ben Upcroft, "Simple Online and Realtime Tracking," in International Conference on Image Processing (ICIP), 2016.
[12] Nicolai Wojke, Alex Bewley, Dietrich Paulus, "Simple Online and Realtime Tracking with a Deep Association Metric," in International Conference on Image Processing (ICIP), 2017.
[13] Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, Shengjin Wang, "Towards Real-Time Multi-Object Tracking," in European Conference on Computer Vision (ECCV), 2020.
[14] Jiahe Li, Xu Gao, Tingting Jiang, "Graph Network for Multiple Object Tracking," in Winter Conference on Applications of Computer Vision (WACV), 2020.
[15] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer, "TrackFormer: Multi-Object Tracking with Transformers," in Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[16] Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, Yichen Wei, “MOTR: End-to-End Multiple-Object Tracking with Transformer,” in arXiv:2105.03247, 2021.
[17] Chenglin Yang, Yilin Wang, Jianming Zhang, He Zhang, Zijun Wei, Zhe Lin, Alan Yuille, “Lite Vision Transformer with Enhanced Self-Attention,” in arXiv:2112.10809, 2021.
[18] Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan, “VOLO: Vision Outlooker for Visual Recognition,” in arXiv:2106.13112, 2021.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual Learning for Image Recognition,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár & C. Lawrence Zitnick. “Microsoft COCO: Common Objects in Context,” in European Conference on Computer Vision (ECCV), 2014.