| 研究生: |
Pimpa Cheewaprakobkit Pimpa Cheewaprakobkit |
|---|---|
| 論文名稱: |
基於注意力和記憶動態融合的單物件追蹤方法 Advancing Single Object Tracking based on Fusion of Attention and Memory Dynamics |
| 指導教授: |
施國琛
Timothy K. Shih 林智揚 Chih-Yang Lin |
| 口試委員: | |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 英文 |
| 論文頁數: | 59 |
| 中文關鍵詞: | Temporal Convolutional Network 、attention mechanism 、spatial-temporal memory 、single object tracking |
| 外文關鍵詞: | Temporal Convolutional Network, attention mechanism, spatial-temporal memory, single object tracking |
| 相關次數: | 點閱:24 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
深度神經網路已經徹底改變了電腦視覺領域,帶來了單物件追蹤任務的重大進展。然而,這些網路仍然面臨在動態環境中處理目標物件外觀變化和遮擋的挑戰。此外,在長時間內保持一致的追蹤,特別是在面對相似背景物件時,仍是一個重大挑戰。單物件追蹤的核心困難在於目標在整個視頻序列中經常發生的外觀變化,這些變化,例如縱橫比、 大小比例和姿勢狀態的變化,會顯著影響追蹤器的穩定性。此外,被其他物件遮擋和雜亂的背景,使得保持一致追蹤的過程變得複雜。
為了解決這些挑戰,本論文提出了一種利用時間卷積網路(TCN)、注意力機制、和空間-時序記憶網路相結合的追蹤架構。TCN組件通過捕捉視頻序列中的時間依賴性、 並且起了關鍵作用。這使得模型能夠學習物件的外觀如何隨時間演變,從而對短期外觀變化具有更高的適應性。結合注意力機制提供了雙重好處。首先,它使模型能夠根據當前背景,聚焦在畫面中最相關的區域,降低了模型的計算複雜度。這在背景雜亂或存在多個相似物件的情況下特別有利。其次,注意力機制將模型的注意力,聚焦到對追蹤目標物件至關重要的資訊特徵上。最後一個組件,空間-時序記憶網路,利用了長期記憶的能力。該網路儲存了關於目標物件的歷史訊息,包括其外觀和運動模式。這些儲存的訊息為追蹤器提供了參考點,使其能夠更好地適應目標變形和遮擋。通過有效結合這三個組件,我們提出了架構,來實現比現有方法更優越的追蹤性能。
我們方法的有效性,透過在多個基準數據集上的廣泛評估,得到了驗證,包括 GOT-10K、OTB2015、UAV123 和 VOT2018。我們的模型在GOT-10K數據集上實現了67.5%的AO(平均重疊度),在OTB2015上取得了72.1%的成功得分(AUC),在UAV123上取得了65.8%的成功得分(AUC),並在VOT2018數據集上實現了59.0%的準確性。
結果顯示了我們所提出的方法,在單物件追蹤任務中的卓越追蹤能力,展示其解決外觀變化和長期追蹤場景挑戰的潛力。這項研究提供了一個穩定且靈活的解決方案,結合注意力和記憶動態,增強在複雜現實場景中的追蹤精確性和穩定性,從而推動了追蹤系統的發展。
Deep neural networks have revolutionized the field of computer vision, leading to significant advancements in single object tracking tasks. However, these networks still encounter challenges in handling dynamic environments where target objects undergo appearance changes and occlusions. Additionally, maintaining consistent tracking across extended periods, especially when faced with similar-looking background objects, presents a significant challenge. The core difficulty in single object tracking arises from the frequent variations a target's appearance can undergo throughout the video sequence. These variations, such as changes in aspect ratio, scale, and pose, can significantly impact the robustness of trackers. Additionally, occlusions by other objects and cluttered backgrounds further complicate the process of maintaining a consistent track.
To address these challenges, this dissertation proposes a novel tracking architecture that leverages the combined strengths of a temporal convolutional network (TCN), an attention mechanism, and a spatial-temporal memory network. The TCN component plays a critical role by capturing temporal dependencies within the video sequence. This enables the model to learn how an object's appearance evolves over time, resulting in greater resilience to short-term appearance changes. Incorporating an attention mechanism offers a two-fold benefit. Firstly, it reduces the computational complexity of the model by enabling it to focus on the most relevant regions of the frame based on the current context. This is particularly advantageous in scenarios with cluttered backgrounds or multiple similar objects present. Secondly, the attention mechanism directs the model's focus towards informative features that are critical for tracking the target object. The final component, the spatial-temporal memory network, leverages the power of long-term memory. This network stores historical information about the target object, including its appearance and motion patterns. This stored information serves as a reference point for the tracker, allowing it to better adapt to target deformations and occlusions. By effectively combining these three elements, our proposed architecture aims to achieve superior tracking performance compared to existing methods.
The effectiveness of our approach is validated through extensive evaluations on several benchmark datasets, including GOT-10K, OTB2015, UAV123, and VOT2018. Our model achieves a state-of-the-art average overlap (AO) of 67.5% on the GOT-10K dataset, a 72.1% success score (AUC) on OTB2015, a 65.8% success score (AUC) on UAV123, and a 59.0% accuracy on the VOT2018 dataset.
The results highlight the superior tracking capabilities of our proposed approach in single object tracking tasks, demonstrating its potential to address the challenges posed by appearance variations and prolonged tracking scenarios. This research contributes to the advancement of tracking systems by offering a robust and adaptive solution that combines attention and memory dynamics to enhance tracking accuracy and robustness in complex real-world scenarios.
REFERENCES
[1] Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” in 2020 IEEE/CVF Conf. Comp. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 6667–6676.
[2] Q. Xie, K. Liu, A. Zhiyong, L. Wang, Y. Li, and Z. Xiang, “A novel incremental multi-template update strategy for robust object tracking,” IEEE Access, vol. 8, no. 1, pp. 162668–162682, 2020, doi: 10.1109/ACCESS.2020.3021786.
[3] J. Fan, K. Zhang, Y. Huang, Y. Zhu, and B. Chen, “Parallel spatio-temporal attention-based TCN for multivariate time series prediction,” Neural. Comput. Appl., May 2021, doi: 10.1007/s00521-021-05958-z.
[4] Y. He and J. Zhao, “Temporal convolutional networks for anomaly detection in time series,” J. Phys. Conf. Ser., vol. 1213, no. 4, pp. 1-6, Jun. 2019, doi: 10.1088/1742-6596/1213/4/042050.
[5] Z. Lai, E. Lu, and W. Xie, “MAST: A memory-augmented self-supervised tracker,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 6478-6487.
[6] H. Gao and C. Hu, “A new approach of template matching and localization based on the guidance of feature points,” in 2018 IEEE Int. Conf. Inf. Autom (ICIA), Aug. 2018, pp. 548–553.
[7] T. Shi, D. Wang, and H. Ren, “Triplet network template for siamese trackers,” IEEE Access, vol. 9, no. 1, pp. 44426–44435, 2021, doi: 10.1109/ACCESS.2021.3066294.
[8] H. Lu, X. Ren, and M. Tong, “Object tracking algorithm of fully-convolutional siamese networks using the templates with suppressed background information,” in 2021 IEEE Int. Conf. Emerg. Technol. Fact. Autom. (ETFA), Sep. 2021, pp. 1–6.
[9] W. R. Tan and S.-H. Lai, “i-Siam: Improving siamese tracker with distractors suppression and long-term strategies,” in 2019 IEEE/CVF Int. Conf. Comput. Vis.Workshop (ICCVW), 2019, pp. 55-63.
[10] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in 2018 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 8971–8980.
[11] Z. Zhou, X. Li, T. Zhang, H. Wang, and Z. He, “Object tracking via spatial-temporal memory network,” IEEE Trans. Circuits and Syst. Video Technol., vol. 32, no. 5, pp. 2976–2989, May 2022, doi: 10.1109/TCSVT.2021.3094645.
[12] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” CoRR., vol. abs/1803.01271, 2018.
[13] R. Zhu, W. Liao, and Y. Wang, “Short-term prediction for wind power based on temporal convolutional network,” Energy Rep., vol. 6, no. 1, pp. 424–429, Dec. 2020, doi: 10.1016/j.egyr.2020.11.219.
[14] P. Lara-Benítez, M. Carranza-García, J. M. Luna-Romera, and J. C. Riquelme, “Temporal convolutional networks applied to energy-related time series forecasting,” Appl. Sci., vol. 10, no. 7, Apr. 2020, doi: 10.3390/app10072322.
[15] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neur. Inf. Proc. Syst (NIPS), Jun. 2017, pp. 6000-6010.
[16] Z. Fu, Q. Liu, Z. Fu, and Y. Wang, “STMTrack: Template-free visual tracking with space-time memory networks,” in 2021 IEEE/CVF Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13769–13778.
[17] X. Wang, R. Girshick, A. Gupta and K. He, “Non-local neural networks,” in 2018 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 7794-7803.
[18] M. Dunnhofer, N. Martinel, and C. Micheloni, “Tracking-by-trackers with a distilled and reinforced model,” in Asian Conf. Comput. Vis. (ACCV), 2020, pp. 631–650.
[19] L. Huang, X. Zhao, and K. Huang, “GOT-10k: A large high-diversity benchmark for generic object tracking in the wild,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 43, no. 5, pp. 1562–1577, May 2021, doi: 10.1109/TPAMI.2019.2957464.
[20] J. Ye, C. Fu, F. Lin, F. Ding, S. An, and G. Lu, “Multi-Regularized correlation filter for UAV tracking and self-localization,” IEEE Trans. Ind. Electron., vol. 69, no. 6, pp. 6004–6014, 2022.
[21] F. Chen, F. Zhang, and X. Wang, “Two stages for visual object tracking,” in 2021 Int. Conf. Intell. Comput. Autom. Appl. (ICAA), Jun. 2021, pp. 165–170.
[22] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ATOM: Accurate tracking by overlap maximization,” in 2019 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4655–4664.
[23] G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in 2019 IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 6181–6190.
[24] A. Lukezic, J. Matas, and M. Kristan, “D3S – A discriminative single shot segmentation tracker,” in 2020 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 7131–7140.
[25] Y. Luo, M. Xu, C. Yuan, X. Cao, L. Zhang, Y. Xu, T. Wang, and Q. Feng, “SiamSNN: Siamese spiking neural networks for energy-efficient object tracking,” in Int. Conf. Neural Netw., 2021, pp. 182–194.
[26] F. Gu, J. Lu, and C. Cai, “A robust attention-enhanced network with transformer for visual tracking,” Multimed. Tools. Appl., vol. 82, no. 26, pp. 40761–40782, Nov. 2023, doi: 10.1007/s11042-023-15168-5.
[27] L. Zhang, A. Gonzalez-Garcia, J. Van De Weijer, M. Danelljan, and F. S. Khan, “Learning the model update for siamese trackers,” in 2019 IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 4009–4018.
[28] C. and S. Y. and Y. X. Jia Shuai and Ma, “Robust tracking against adversarial attacks,” Comput. Vis. (ECCV), 2020, pp. 69–84.
[29] H. Fan and H. Ling, “Siamese cascaded region proposal networks for real-time visual tracking,” in 2019 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7944–7953.
[30] M. Zhang, K. Van Beeck, and T. Goedemé, “Object tracking with multiple dynamic templates updating,” in Proc. Int. Conf. Image Vis. Comput. (IVCNZ2022), 2022, pp. 144–158.
[31] H. Dong, J. Jiao, and Y. Bai, “Bounding-box centralization for improving SiamFC++,” in 2021 Asian Conf. Artif. Intell. Technol. (ACAIT), Oct. 2021, pp. 196–203.
[32] Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” Comput. Vis. (ECCV), 2020, pp. 771–787.
[33] T. -Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, “Focal loss for dense object detection,” in 2017 IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2999-3007.
[34] Z. Tian, C. Shen, H. Chen and T. He, “FCOS: Fully convolutional one-stage object detection,” in 2019 IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 9626-9635.
[35] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “Unitbox: An advanced object detection network,” in ACM MM, Oct. 2016, pp. 516–520.
[36] D. Yuan et al., “Active learning for deep visual tracking,” IEEE Trans. Neural. Netw. Learn. Syst., pp. 1–13, 2023, doi: 10.1109/TNNLS.2023.3266837.
[37] D. Xing, N. Evangeliou, A. Tsoukalas, and A. Tzes, “Siamese transformer pyramid networks for real-time UAV tracking,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2022, pp. 2139– 2148.
[38] D. Yuan, X. Chang, Z. Li, and Z. He, “Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 18, no. 3, 2022, doi: 10.1145/3486678.
[39] D. Guo, Y. Shao, Y. Cui, Z. Wang, L. Zhang, and C. Shen, “Graph attention tracking,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 9543-9552.
[40] S. Xiang, T. Zhang, S. Jiang, Y. Han, Y. Zhang, C. Du, X. Guo, L. Yu, Y. Shi, and Y. Hao, “Spiking SiamFC++: Deep spiking neural network for object tracking,” CoRR., vol. abs/2209.12010, 2022.
[41] Q. Shen, L. Qiao, J. Guo, P. Li, X. Li, B. Li, W. Feng, W. Gan, W. Wu, and W. Ouyang, “Unsupervised learning of accurate siamese tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8101–10.
[42] J. Zhang, Y. Liu, H. Liu, J. Wang, and Y. Zhang, “Distractor-aware visual tracking using hierarchical correlation filters adaptive selection,” Appl. Intell., vol. 52, no. 6, pp. 6129–6147, 2022.
[43] S. Ma, L. Zhang, Z. Hou, X. Yang, L. Pu, and X. Zhao, “Robust visual tracking via adaptive feature channel selection,” Int. J. Intell. Syst., vol. 37, no. 10, pp. 6951–6977, 2022.
[44] S. Ma, Z. Zhao, Z. Hou, L. Zhang, X. Yang, and L. Pu, “Correlation filters based on multi-expert and game theory for visual object tracking,” IEEE Trans. Instrum. Meas., vol. 71, no.1, pp. 1–14, 2022.
[45] S. Ma, B. Zhao, Z. Hou, W. Yu, L. Pu, and L. Zhang, “Robust visual object tracking based on feature channel weighting and game theory,” Int. J. Intell. Syst., vol.2023, no.1, pp. 1–19, 2023, doi: 10.1155/2023/6731717.
[46] F. Gu, J. Lu, and C. Cai, “RPformer: A robust parallel transformer for visual tracking in complex scenes,” IEEE Trans. Instrum. Meas., vol. 71, no.1, 2022, doi: 10.1109/TIM.2022.3170972.
[47] J. Nie, H. Wu, Z. He, Y. Yang, M. Gao, and Z. Dong, “Learning Localization-aware Target Confidence for Siamese Visual Tracking,” Apr. 2022, [Online]. Available: http://arxiv.org/abs/2204.14093
[48] M. Mueller, N. G. Smith, and B. Ghanem, “A Benchmark and Simulator for UAV Tracking,” in European Conference on Computer Vision, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:10184155
[49] Y. Liu, H. Yan, W. Zhang, M. Li, and L. Liu, “An adaptive spatiotemporal correlation filtering visual tracking method,” PLoS One, vol. 18, no. 1, p. e0279240, Jan. 2023, doi: 10.1371/journal.pone.0279240.
[50] D. Zhang, Y. Fu, and Z. Zheng, “UAST: Uncertainty-Aware siamese tracking,” in Proc. Int. Conf. Machine Learning, vol. 162, no. 1 , pp. 26161–26175.