| 研究生: |
珊芝莉 S P Kasthuri Arachchi |
|---|---|
| 論文名稱: |
以多模態時空域建模的深度學習方法分類影像中的動態模式 Modelling Spatial-Motion Multimodal Deep Learning Approaches to Classify Dynamic Patterns of Videos |
| 指導教授: |
施國琛 教授
Prof. Timothy K. Shih |
| 口試委員: | |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2020 |
| 畢業學年度: | 108 |
| 語文別: | 英文 |
| 論文頁數: | 160 |
| 中文關鍵詞: | 動態圖形分類 、深度學習 、時空數據 、卷積神經網路 、循環神經網路 |
| 外文關鍵詞: | Dynamic Pattern Classification, Deep Learning, Spatiotemporal Data, Convolution Neural Network, Recurrent Neural Network |
| 相關次數: | 點閱:8 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
對於電腦視覺,影片分類是一重要的過程,可用來分析影片內容之語意訊息。本篇論文改良常見之深度學習分類模型,提出適用於動態影片分類之多模式深度學習方法。當影片處於不同照明等嚴苛環境下,傳統方法提供之手動功能是不足且沒有效率的,尤其是針對複雜內容物的影片。先前的影片分類研究中,主要專注於影片各影片流之關聯性,本篇論文則使用深度學習做為策略,成功地提升影片分類之準確率。大多深度學習模型使用卷積神經網路與長短期記憶網路為基底模型,可用來做物件與行為之分類,並且能夠在動態時間之影片分類任務中能有很好的表現。
首先,本篇論文單流的網路及底層之實驗網路包含卷積神經網路 (CNN) 長短期循環網路(LSTM)及環循神經單元(GRU)。在LSTM與GRU模型中,各層的參數與Dropout值都是經由最佳化調整而產生的。本研究三個模型中將被比較:(1) LRCN:將捲積層與遠程時間遞歸相結合、(2) seqLSTMs:對順序數據進行建模的最有效之模型、(3) seqGRUs:在運算量的表現上比LSTM還要好。
其次,為了考量空間中動量之關係,本論文提出以影像及光流影像之雙流輸入為主之新穎模型,稱之為狀態交換長短期記憶(SE-LSTM)亦為本篇論文的貢獻。藉由SE-LSTM,將能夠完成動態影片在於短期運動、空間和長期時間信息上分類之任務,並能透過外觀流和運動流先前單元之狀態交換信息來擴展LSTM。此外,本篇論文提出一將SE-LSTM與CNN相結合的雙流模型Dual-CNNSELSTM。為了驗證SE-LSTM模型架構之表現,本篇論文針對各樣視頻如煙花、手勢和人的行為做驗證。實驗結果證明,本論文提出的雙流Dual-CNNSELSTM模型結構其性能明顯優於其他單流和雙流為主之模型,手勢、煙花和人為動作數據集HMDB51的準確度分別達到81.62%,79.87%和69.86%。因此,總體結果證明,所提出的模型適合靜態背景動態模式分類,其表現超越Dual-3DCNNLSTM模型及其他模型。
Video classification is an essential process for analyzing the pervasive semantic information of video content in computer vision. This thesis presents multimodal deep learning approaches to classify the dynamic patterns of videos, beyond common types of pattern classifications. Traditional handcrafted features are insufficient when classifying complex video information due to the similarity of visual contents with different illumination conditions. Prior studies of video classifications focused on the relationship between the standalone streams themselves. In contrary, this study leverages the effects of deep learning methodologies to improve video analysis performance significantly. Convolution Neural Network (CNN) and Long Short-term Memory (LSTM) are widely used to build complex models and have shown great competency in modeling temporal dynamics in video-based pattern classification.
First, the single-stream networks and the underlying experimental models consist of CNN, LSTM and Gated Recurrent Unit (GRU) are considered. Their layer parameters are fine-tuned and different dropout values are used with sequence LSTM and GRU models. During this study, the accuracy of three basic models: (1) a Long-term Recurrent Convolutional Network (LRCN), which combine convolutional layers with long-range temporal recursion, (2) seqLSTMs model, one of the most effective structures to model sequential data and (3) seqGRUs model, which has less computational steps than LSTM, are compared.
Secondly, an approach with two-stream network architectures taking both RGB and optical flow data as input is used considering spatial motion relationships. As the main contributions of this work, a novel two-stream neural network concept, named state-exchanging long short-term memory (SE-LSTM) is introduced. With the model of spatial motion state-exchanging, the SE-LSTM can classify dynamic patterns of videos integrating short-term motion, spatial, and long-term temporal information. The SE-LSTM extends the general purpose of LSTM by exchanging the information with previous cell states of both appearance and motion streams. Further, a novel two-stream model Dual-CNNSELSTM utilizing the SE-LSTM concept combined with a CNN is proposed. Various video datasets: firework displays, hand gestures and human actions are used to validate the proposed SE-LSTM architecture. Experimental results demonstrate that the performance of the proposed two-stream Dual-CNNSELSTM architecture significantly outperforms other single and two-stream baseline models achieving accuracies of 81.62%, 79.87%, and 69.86% with hand gestures, fireworks displays, and HMDB51 human actions datasets, respectively. Therefore, the overall results signify that the proposed model is most suited to static background dynamic pattern classifications over baseline and Dual-3DCNNLSTM models.
[1] Z. Wu, T. Yao, Y. Fu, and Y.-G. Jiang, “Deep Learning for Video Classification and Captioning,” ArXiv160906782 Cs, pp. 3–29, Dec. 2017, doi: 10.1145/3122865.3122867.
[2] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning Hierarchical Features for Scene Labeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, Aug. 2013, doi: 10.1109/TPAMI.2012.231.
[3] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” ArXiv14091556 Cs, Sep. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1409.1556.
[4] S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 221–231, Jan. 2013, doi: 10.1109/TPAMI.2012.59.
[5] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 157–166, Mar. 1994, doi: 10.1109/72.279181.
[6] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 568–576.
[7] H. Wang and C. Schmid, “Action Recognition with Improved Trajectories,” in 2013 IEEE International Conference on Computer Vision, Dec. 2013, pp. 3551–3558, doi: 10.1109/ICCV.2013.441.
[8] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond Short Snippets: Deep Networks for Video Classification,” ArXiv150308909 Cs, Mar. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1503.08909.
[9] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification,” ArXiv150401561 Cs, Apr. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1504.01561.
[10] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1010–1019, Accessed: Jun. 13, 2020. [Online]. Available: http://openaccess.thecvf.com/content_cvpr_2016/html/Shahroudy_NTU_RGBD_A_CVPR_2016_paper.html.
[11] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 1961–1970, doi: 10.1109/CVPR.2016.216.
[12] L. Pigou, A. van den Oord, S. Dieleman, M. Van Herreweghe, and J. Dambre, “Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video,” ArXiv150601911 Cs Stat, Feb. 2016, Accessed: Jan. 18, 2020. [Online]. Available: http://arxiv.org/abs/1506.01911.
[13] Yong Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1110–1118, doi: 10.1109/CVPR.2015.7298714.
[14] V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential Recurrent Neural Networks for Action Recognition,” ArXiv150406678 Cs, Apr. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1504.06678.
[15] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 4207–4215, doi: 10.1109/CVPR.2016.456.
[16] N. L. Hakim, T. K. Shih, S. P. Kasthuri Arachchi, W. Aditya, Y.-C. Chen, and C.-Y. Lin, “Dynamic Hand Gesture Recognition Using 3DCNN and LSTM with FSM Context-Aware Model,” Sensors, vol. 19, no. 24, p. 5429, Jan. 2020, doi: 10.3390/s19245429.
[17] S. Abu-El-Haija et al., “YouTube-8M: A Large-Scale Video Classification Benchmark,” ArXiv160908675 Cs, Sep. 2016, Accessed: Apr. 28, 2020. [Online]. Available: http://arxiv.org/abs/1609.08675.
[18] Y. Jiang, Z. Wu, J. Wang, X. Xue, and S. Chang, “Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, pp. 352–364, Feb. 2018, doi: 10.1109/TPAMI.2017.2670560.
[19] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, pp. 386–408, 1958, doi: 10.1037/h0042519.
[20] “Rectifier (neural networks),” Wikipedia. Dec. 04, 2018, Accessed: Jan. 21, 2020. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Rectifier_(neural_networks)&oldid=871884348.
[21] “Introduction to Artificial Neural Networks - Part 1.” http://www.theprojectspot.com/tutorial-post/introduction-to-artificial-neural-networks-part-1/7 (accessed Jan. 21, 2020).
[22] F. M. Soares and A. M. F. Souza, Neural Network Programming with Java. Packt Publishing Ltd, 2017.
[23] “Receptive fields and functional architecture of monkey striate cortex - Hubel - 1968 - The Journal of Physiology - Wiley Online Library.” https://physoc.onlinelibrary.wiley.com/doi/abs/10.1113/jphysiol.1968.sp008455 (accessed Jan. 21, 2020).
[24] Y. LeCun et al., “Backpropagation Applied to Handwritten Zip Code Recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989, doi: 10.1162/neco.1989.1.4.541.
[25] D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “Flexible, High Performance Convolutional Neural Networks for Image Classification,” in Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Two, Barcelona, Catalonia, Spain, 2011, pp. 1237–1242, doi: 10.5591/978-1-57735-516-8/IJCAI11-210.
[26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale Video Classification with Convolutional Neural Networks,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732, Accessed: Jan. 21, 2020. [Online]. Available: https://www.cv-foundation.org/openaccess/content_cvpr_2014/html/Karpathy_Large-scale_Video_Classification_2014_CVPR_paper.html.
[27] D. Britz, “Understanding Convolutional Neural Networks for NLP,” WildML, Nov. 07, 2015. http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ (accessed Jan. 21, 2020).
[28] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998, doi: 10.1109/5.726791.
[29] “CS231n Convolutional Neural Networks for Visual Recognition.” http://cs231n.github.io/convolutional-networks/ (accessed Jan. 21, 2020).
[30] M. Ranzato, “Large-Scale Visual Recognition With Deep Learning,” p. 134.
[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
[32] M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” in Computer Vision – ECCV 2014, 2014, pp. 818–833.
[33] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015, doi: 10.1007/s11263-015-0816-y.
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778, Accessed: Jan. 21, 2020. [Online]. Available: https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html.
[35] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution,” in Computer Vision – ECCV 2016, 2016, pp. 694–711.
[36] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.
[37] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: continual prediction with LSTM,” pp. 850–855, Jan. 1999, doi: 10.1049/cp:19991218.
[38] C. Metz, “With QuickType, Apple wants to do more than guess your next text. It wants to give you an AI.,” Wired, Jun. 14, 2016.
[39] “A Beginner’s Guide to LSTMs and Recurrent Neural Networks,” Skymind. http://skymind.ai/wiki/lstm (accessed Jan. 21, 2020).
[40] “Nikhil Buduma | A Deep Dive into Recurrent Neural Nets,” The Musings of Nikhil Buduma. http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-networks/ (accessed Jun. 13, 2020).
[41] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, Jul. 2000, vol. 3, pp. 189–194 vol.3, doi: 10.1109/IJCNN.2000.861302.
[42] K. Yao, T. Cohn, K. Vylomova, K. Duh, and C. Dyer, “Depth-Gated LSTM,” ArXiv150803790 Cs, Aug. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1508.03790.
[43] J. Koutník, K. Greff, F. Gomez, and J. Schmidhuber, “A Clockwork RNN,” ArXiv14023511 Cs, Feb. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1402.3511.
[44] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “LSTM: A Search Space Odyssey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017, doi: 10.1109/TNNLS.2016.2582924.
[45] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An Empirical Exploration of Recurrent Network Architectures,” p. 9.
[46] B. Krause, L. Lu, I. Murray, and S. Renals, “Multiplicative LSTM for sequence modelling,” ArXiv160907959 Cs Stat, Oct. 2017, Accessed: Jun. 13, 2020. [Online]. Available: http://arxiv.org/abs/1609.07959.
[47] Y. Wu et al., “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,” ArXiv160908144 Cs, Oct. 2016, Accessed: Jun. 13, 2020. [Online]. Available: http://arxiv.org/abs/1609.08144.
[48] A. Graves, S. Fernández, and J. Schmidhuber, “Multi-dimensional Recurrent Neural Networks,” in Artificial Neural Networks – ICANN 2007, 2007, pp. 549–558.
[49] M. F. Stollenga, W. Byeon, M. Liwicki, and J. Schmidhuber, “Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2998–3006.
[50] N. Kalchbrenner, I. Danihelka, and A. Graves, “Grid Long Short-Term Memory,” ArXiv150701526 Cs, Jul. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1507.01526.
[51] M. Cord and P. Cunningham, Eds., Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval. Berlin Heidelberg: Springer-Verlag, 2008.
[52] O. Bousquet, U. von Luxburg, and G. Ratsch, Advanced Lectures On Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tubingen, Germany, August 4-16, 2003, Revised Lectures (Lecture Notes in Computer Science). SpringerVerlag, 2004.
[53] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Mar. 2010, pp. 249–256, Accessed: Jan. 21, 2020. [Online]. Available: http://proceedings.mlr.press/v9/glorot10a.html.
[54] H. Robbins and S. Monro, “A Stochastic Approximation Method,” Ann. Math. Stat., vol. 22, no. 3, pp. 400–407, Sep. 1951, doi: 10.1214/aoms/1177729586.
[55] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2933–2941.
[56] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Comput. Math. Math. Phys., vol. 4, no. 5, pp. 1–17, Jan. 1964, doi: 10.1016/0041-5553(64)90137-5.
[57] S. Ruder, “An overview of gradient descent optimization algorithms,” ArXiv160904747 Cs, Jun. 2017, Accessed: Jun. 13, 2020. [Online]. Available: http://arxiv.org/abs/1609.04747.
[58] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural Netw. Off. J. Int. Neural Netw. Soc., vol. 12, no. 1, pp. 145–151, Jan. 1999.
[59] Y. NESTEROV, “A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2),” Dokl. USSR, vol. 269, pp. 543–547, 1983, Accessed: Jan. 21, 2020. [Online]. Available: https://ci.nii.ac.jp/naid/20001173129/.
[60] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances in optimizing recurrent networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 8624–8628, doi: 10.1109/ICASSP.2013.6639349.
[61] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” J. Mach. Learn. Res., vol. 12, no. Jul, pp. 2121–2159, 2011, Accessed: Jan. 21, 2020. [Online]. Available: http://www.jmlr.org/papers/v12/duchi11a.html.
[62] J. Dean et al., “Large Scale Distributed Deep Networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1223–1231.
[63] J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, Oct. 2014, pp. 1532–1543, Accessed: Jan. 21, 2020. [Online]. Available: http://www.aclweb.org/anthology/D14-1162.
[64] M. D. Zeiler, “ADADELTA: An Adaptive Learning Rate Method,” ArXiv12125701 Cs, Dec. 2012, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1212.5701.
[65] V. Bushaev, “Understanding RMSprop — faster neural network learning,” Towards Data Science, Sep. 02, 2018. https://towardsdatascience.com/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a (accessed Jan. 21, 2020).
[66] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” ArXiv14126980 Cs, Dec. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1412.6980.
[67] Z. Zhang, L. Ma, Z. Li, and C. Wu, “Normalized Direction-preserving Adam,” ArXiv170904546 Cs Stat, Sep. 2018, Accessed: Jun. 14, 2020. [Online]. Available: http://arxiv.org/abs/1709.04546.
[68] V. Bushaev, “Adam — latest trends in deep learning optimization.,” Medium, Oct. 24, 2018. https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c (accessed Jun. 14, 2020).
[69] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html.
[14] J. Bayer, C. Osendorfer, D. Korhammer, N. Chen, S. Urban, and P. van der Smagt, “On Fast Dropout and its Applicability to Recurrent Networks,” ArXiv13110701 Cs Stat, Nov. 2013, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1311.0701.
[70] V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout improves Recurrent Neural Networks for Handwriting Recognition,” ArXiv13124569 Cs, Nov. 2013, Accessed: Jan. 21, 2019. [Online]. Available: http://arxiv.org/abs/1312.4569.
[71] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent Neural Network Regularization,” ArXiv14092329 Cs, Sep. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1409.2329.
[72] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” ArXiv150203167 Cs, Feb. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1502.03167.
[73] T. Cooijmans, N. Ballas, C. Laurent, Ç. Gülçehre, and A. Courville, “Recurrent Batch Normalization,” ArXiv160309025 Cs, Mar. 2016, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1603.09025.
[74] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” ArXiv13112524 Cs, Nov. 2013, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1311.2524.
[75] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN Features off-the-shelf: an Astounding Baseline for Recognition,” ArXiv14036382 Cs, Mar. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1403.6382.
[76] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp. 248–255, doi: 10.1109/CVPR.2009.5206848.
[77] C. Szegedy et al., “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1–9, doi: 10.1109/CVPR.2015.7298594.
[78] S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov, “Exploiting Image-trained CNN Architectures for Unconstrained Video Classification,” ArXiv150304144 Cs, Mar. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1503.04144.
[79] X. Alameda-Pineda et al., “RAVEL: an annotated corpus for training robots with audiovisual abilities,” J. Multimodal User Interfaces, vol. 7, no. 1, pp. 79–91, Mar. 2013, doi: 10.1007/s12193-012-0111-y.
[80] Z. Xu, Y. Yang, and A. G. Hauptmann, “A Discriminative CNN Video Representation for Event Detection,” ArXiv14114006 Cs, Nov. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1411.4006.
[81] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Jun. 2010, pp. 3304–3311, doi: 10.1109/CVPR.2010.5540039.
[82] Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, and J. Luo, “Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, New York, NY, USA, 2016, pp. 159–166, doi: 10.1145/2911996.2912001.
[83] J. Donahue et al., “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634, Accessed: Jan. 21, 2020. [Online]. Available: https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Donahue_Long-Term_Recurrent_Convolutional_2015_CVPR_paper.html.
[84] L. Yao et al., “Describing Videos by Exploiting Temporal Structure,” ArXiv150208029 Cs Stat, Feb. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1502.08029.
[85] A. Graves, A. Mohamed, and G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks,” ArXiv13035778 Cs, Mar. 2013, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1303.5778.
[86] Y. Jia et al., “Caffe: Convolutional Architecture for Fast Feature Embedding,” ArXiv14085093 Cs, Jun. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1408.5093.
[87] “[1412.3555] Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.” https://arxiv.org/abs/1412.3555 (accessed Jan. 21, 2020).
[88] N. Léonard, S. Waghmare, Y. Wang, and J.-H. Kim, “rnn : Recurrent Library for Torch,” ArXiv151107889 Cs, Nov. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1511.07889.
[89] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent Models of Visual Attention,” ArXiv14066247 Cs Stat, Jun. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1406.6247.
[90] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple Object Recognition with Visual Attention,” ArXiv14127755 Cs, Dec. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1412.7755.
[91] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification,” ArXiv150401561 Cs, Apr. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1504.01561.
[92] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond Short Snippets: Deep Networks for Video Classification,” ArXiv150308909 Cs, Mar. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1503.08909.
[93] V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential Recurrent Neural Networks for Action Recognition,” ArXiv150406678 Cs, Apr. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1504.06678.
[94] Z. Wu, Y.-G. Jiang, X. Wang, H. Ye, and X. Xue, “Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification,” in Proceedings of the 24th ACM International Conference on Multimedia, New York, NY, USA, 2016, pp. 791–800, doi: 10.1145/2964284.2964328.
[95] S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 221–231, Jan. 2013, doi: 10.1109/TPAMI.2012.59.
[96] H. Wang and C. Schmid, “Action Recognition with Improved Trajectories,” in 2013 IEEE International Conference on Computer Vision, Dec. 2013, pp. 3551–3558, doi: 10.1109/ICCV.2013.441.
[97] L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, “Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks,” ArXiv151000562 Cs, Oct. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1510.00562.
[98] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 568–576.
[99] L. Wang, Y. Qiao, and X. Tang, “Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors,” 2015 IEEE Conf. Comput. Vis. Pattern Recognit. CVPR, pp. 4305–4314, Jun. 2015, doi: 10.1109/CVPR.2015.7299059.
[100] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional Two-Stream Network Fusion for Video Action Recognition,” ArXiv160406573 Cs, Apr. 2016, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1604.06573.
[101] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” ArXiv14120767 Cs, Dec. 2014, Accessed: Apr. 28, 2020. [Online]. Available: http://arxiv.org/abs/1412.0767.
[102] “Home - Keras Documentation.” https://keras.io/ (accessed Jan. 18, 2020).
[103] “Understanding LSTM Networks -- colah’s blog.” http://colah.github.io/posts/2015-08-Understanding-LSTMs/ (accessed Apr. 28, 2020).
[104] V.-M. Khong and T.-H. Tran, “Improving Human Action Recognition with Two-Stream 3D Convolutional Neural Network,” in 2018 1st International Conference on Multimedia Analysis and Pattern Recognition (MAPR), Apr. 2018, pp. 1–6, doi: 10.1109/MAPR.2018.8337518.
[105] N. L. Hakim, T. K. Shih, S. P. Kasthuri Arachchi, W. Aditya, Y.-C. Chen, and C.-Y. Lin, “Dynamic Hand Gesture Recognition Using 3DCNN and LSTM with FSM Context-Aware Model,” Sensors, vol. 19, no. 24, p. 5429, Jan. 2020, doi: 10.3390/s19245429.
[106] H. Phan et al., “Beyond Equal-Length Snippets: How Long is Sufficient to Recognize an Audio Scene?,” ArXiv181101095 Cs Eess, Nov. 2018, Accessed: Apr. 28, 2020. [Online]. Available: http://arxiv.org/abs/1811.01095.
[107] “Serre Lab » HMDB: a large human motion database.” http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/ (accessed Jan. 17, 2020).
[108] G. Farnebäck, “Two-Frame Motion Estimation Based on Polynomial Expansion,” in Image Analysis, 2003, pp. 363–370.
[109]“OpenCV:OpticalFlow.” https://docs.opencv.org/3.4/d7/d8b/tutorial_py_lucas_kanade.html (accessed Jan. 21, 2020).
[110] C. Igel and M. Hüsken, “Improving the Rprop Learning Algorithm,” 2000.
[112] S. P. K. Arachchi, T. K. Shih, C.-Y. Lin, and G. Wijayarathna, “Deep Learning-Based Firework Video Pattern Classification,” J. Internet Technol., vol. 20, no. 7, pp. 2033–2042, Dec. 2020, Accessed: Jan. 17, 2020. [Online]. Available: https://jit.ndhu.edu.tw/article/view/2190.
[113] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal Residual Networks for Video Action Recognition,” ArXiv161102155 Cs, Nov. 2016, Accessed: Jan. 18, 2020. [Online]. Available: http://arxiv.org/abs/1611.02155.