可重構深度神經網路加速器設計｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	田繹 Yi Tien
論文名稱：	可重構深度神經網路加速器設計 Design of a Reconfigurable Deep Neural Network Accelerator
指導教授：	李進福 Jin-Fu Li
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2020
畢業學年度：	108
語文別：	英文
論文頁數：	80
中文關鍵詞：	硬體加速器、深度神經網路、可重構
相關次數：	點閱：9 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

深度卷積神經網路(DCNNs)被廣泛地用於人工智慧的應用，例如：物件辨識及影像分類等等。現今的深度卷積神經網路具有大量計算與大量數據的特性，為了在不同應用中符合對性能的要求，加速器被用來執行深度卷積神經網路的運算。在本論文中，我們根據以動態隨機存取記憶體(DRAM)儲存資料及使用加速器來執行計算的深度卷積神經網路推論系統，提出架構探索的方法。此方法以減少資料傳輸時間與計算時間的差異而定義出之加速器架構。加速器包含了數叢(clusters)之處理單元(PEs)、一個可重構之記憶體單元及一個控制器。交換器(switch)用以連接一叢處理單元陣列與可重構記憶體單元。可重構記憶體是由三個靜態隨機存取記憶體組合而成，每一靜態隨機存取記憶體可以調整其大小，以符合不同卷積層的記憶體需求。處理單元陣列與可重構記憶體之組態是由基於子層之參數選定流程(sublayer-based parameters decision flow)所決定。與現存之研究相比，本論文提出之加速器在卷積層及深度卷積神經網路各提升4.2%及17.4%的硬體利用率。我們根據提出之可重構加速器架構，在Xilinx ZCU-102開發板上實現了一個推論MobileNet V1的可重構加速器，此一加速器包含了1092KB的靜態隨機存取記憶體與四叢處理單元陣列，每一叢處理單元陣列包含了8個處理單元。實驗結果達到在150MHz的操作頻率下，此一加速器達到每秒1440億次計算及每秒推論40.1張圖片的效能。

Deep convolutional neural networks (DCNNs) are widely used for the artificial intelligence applications, e.g., object recognition and image classification. A modern DCNN model usually needs a huge amount of computations and data. To meet the performance requirement of applications, an accelerator is usually designed to execute the computation of DCNN.
In this thesis, we consider a DCNN inference system using a DRAM to store data and an accelerator to execute the computation. An architecture exploration method based on the minimization of difference between DRAM data access time and computation time is proposed to define the architecture of accelerator. The accelerator consists of multiple clusters of processing elements (PEs), a reconfigurable memory unit, and a controller. A cluster of PEs is connected to the reconfigurable memory unit through a switch box. The reconfigurable memory unit consists of three static random access memories which sizes can be dynamically changed to fit the requirement of different convolutional layers. The configurations of PE array and reconfigurable memory are determined by sublayer-based parameters decision flow which can gain 4.2% and 17.4% increment of hardware resource utilization for convolutional layers and DCNN model in comparison with existing works. We implement the MobileNet V1 model in Xilinx ZCU-102 evaluation board using the proposed reconfigurable accelerator architecture with 1092KB SRAM and four PE clusters in which each cluster has 8 PEs. Ex-
perimental results show that 144 GOPS and 40.1 FPS can be achieved under 100MHz clock rate.

Introduction 1
1 Deep Neural Network 1
2 Deep Convolutional Neural Network Accelerator Architecture 4
3 Previous Work 5
3.1 Single Instruction Multiple Data Stream DNN Accelerator Architecture 5
3.2 Systolic Array DNN Accelerator Architecture 6
4 Motivation 6
5 Contribution  7
6 Thesis Organization 7
Architecture Exploration of Reconfigurable DNN Accelerator 8
1 DNN Inference System  8
2 Roofline Model Analysis and Sublayer-Based Pipeline Inference Flow 9
3 For Loop Analysis of Convolution Operation  12
4 Sublayer-Based Parameters Decision Flow 14
4.1 Tiling Factors Analysis  15
4.2 Loop Unrolling Factors Analysis  20
4.3 On-Chip Memory Analysis  21
5 Analysis Results  24
Proposed Reconfigurable DNN Accelerator Architecture 30
1 Architecture of Reconfigurable DNN Accelerator 30
2 Micro Instruction Set  31
3 On-chip Interconnections and On-Chip Data Reuse  34
3.1 Input Feature Map Data Reuse Strategy  35
3.2 Weight Data Reuse Strategy  35
3.3 Output Feature Map Data Flow  37
4 PE Clusters 38
5 Subsampling and Fully-Connected Layers  40
6 Analysis Results41
A Case Study of MobileNet V1 44
1 MobileNet V1 44
2 Validation Platform  48
2.1 Platform Resources of Xilinx ZCU-102 Evaluation Board  48
2.2 AXI-4 Interface Protocol  49
3 Implementation Details  49
3.1 VIVADO Block Diagram  49
3.2 PE Array Clusters and Reconfigurable On-Chip Memory  51
4 Implementation Results 54
Conclusion and Future Work 60
1 Conclusion  60
2 Future Work  61
                                

[1] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4,pp. 65–76, 2009.
[2] Micron, 4Gb: x4, x8, x16 DDR3 SDRAM Features.
[3] APMemory, 1Gb DDR3 SDRAM Specification.
[4] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W.Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[5] P. G. Xilinx Inc, LogiCORE IP AXI Master Burst v2.0.
[6] U. G. Xilinx Inc, UltraScale Architecture DSP Slice.
[7] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: toward uniformed
representation and acceleration for deep convolutional neural networks,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, no. 11,pp. 2072–2085, 2018.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems,2012, pp. 1097–1105.
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
[10] M. Naphade, D. C. Anastasiu, A. Sharma, V. Jagrlamudi, H. Jeon, K. Liu, M.-C. Chang,
S. Lyu, and Z. Gao, “The NVIDIA AI city challenge,” in 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City
Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), 2017, pp. 1–6.
[11] A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Very deep convolutional networks for natural language processing,” arXiv preprint arXiv:1606.01781, vol. 2, 2016.
[12] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–1955, 2015.
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: unified, realtime object detection,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2016, pp. 779–788.
[14] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space exploration of FPGA-based deep convolutional neural networks,” in 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), 2016, pp. 575–580.
[15] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning,” ACM
SIGARCH Computer Architecture News, vol. 42, no. 1, pp. 269–284, 2014.
[16] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al.,
“Dadiannao: a machine-learning supercomputer,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 609–622.
[17] A. Carbon, J.-M. Philippe, O. Bichler, R. Schmit, B. Tain, D. Briand, N. Ventroux, M. Paindavoine, and O. Brousse, “Pneuro: a scalable energy-efficient programmable hardware accelerator for neural networks,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pp. 1039–1044.
[18] Y. Wang, H. Li, and X. Li, “A case of on-chip memory subsystem design for lowpower CNN accelerators,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 37, no. 10, pp. 1971–1984, 2017.
[19] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: ineffectual-neuron-free deep neural network computing,” ACM SIGARCH
Computer Architecture News, vol. 44, no. 3, pp. 1–13, 2016.
[20] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2016.
[21] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer
Architecture, 2017, pp. 1–12.
[22] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, “Deep convolutional neural network architecture with reconfigurable computation patterns,” IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 8, pp. 2220–2233, 2017.
[23] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015, pp.
161–170.
[24] M. Shahshahani, P. Goswami, and D. Bhatia, “Memory optimization techniques for FPGA based CNN implementations,” in 2018 IEEE 13th Dallas Circuits and Systems Conference (DCAS), 2018, pp. 1–6.
[25] Y.-J. Lin and T. S. Chang, “Data and hardware efficient design for convolutional neural
network,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 5, pp. 1642–1651, 2017.
[26] U. G. Xilinx Inc, Zynq-7000 All Programmable SoC.
[27] “IMAGENET,” http://image-net.org/, accessed: 2020-06-09.
[28] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
[29] U. G. Xilinx Inc, MPSoC Technical Reference Manual.
[30] D. S. Xilinx Inc, Zynq-7000 All Programmable SoC Overview, Advance Product Specification.
[31] ——, Zynq UltraScale+ MPSoC Data Sheet.
[32] ARM, AMBA AXI and ACE Protocol Specification.
[33] P. G. Xilinx Inc, LogiCORE IP Block Memory Generator v8.2.
[34] ——, Integrated Logic Analyzer v6.2.
[35] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang,
“Angel-Eye: a complete design flow for mapping CNN onto embedded FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37,
no. 1, pp. 35–47, 2017.
[36] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Automatic compilation of diverse CNNs onto high-performance FPGA accelerators,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
[37] X. Qu, Z. Huang, N. Mao, Y. Xu, G. Cai, and Z. Fang, “A grain-adaptive computing structure for FPGA CNN acceleration,” in 2019 IEEE 13th International Conference
on ASIC (ASICON), 2019, pp. 1–4.

簡易檢索 / 詳目顯示

相關論文