跳到主要內容

簡易檢索 / 詳目顯示

研究生: 田繹
Yi Tien
論文名稱: 可重構深度神經網路加速器設計
Design of a Reconfigurable Deep Neural Network Accelerator
指導教授: 李進福
Jin-Fu Li
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 80
中文關鍵詞: 硬體加速器深度神經網路可重構
相關次數: 點閱:9下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 深度卷積神經網路(DCNNs)被廣泛地用於人工智慧的應用,例如:物件辨識及影像分類等等。現今的深度卷積神經網路具有大量計算與大量數據的特性,為了在不同應用中符合對性能的要求,加速器被用來執行深度卷積神經網路的運算。在本論文中,我們根據以動態隨機存取記憶體(DRAM)儲存資料及使用加速器來執行計算的深度卷積神經網路推論系統,提出架構探索的方法。此方法以減少資料傳輸時間與計算時間的差異而定義出之加速器架構。加速器包含了數叢(clusters)之處理單元(PEs)、一個可重構之記憶體單元及一個控制器。交換器(switch)用以連接一叢處理單元陣列與可重構記憶體單元。可重構記憶體是由三個靜態隨機存取記憶體組合而成,每一靜態隨機存取記憶體可以調整其大小,以符合不同卷積層的記憶體需求。處理單元陣列與可重構記憶體之組態是由基於子層之參數選定流程(sublayer-based parameters decision flow)所決定。與現存之研究相比,本論文提出之加速器在卷積層及深度卷積神經網路各提升4.2%及17.4%的硬體利用率。我們根據提出之可重構加速器架構,在Xilinx ZCU-102開發板上實現了一個推論MobileNet V1的可重構加速器,此一加速器包含了1092KB的靜態隨機存取記憶體與四叢處理單元陣列,每一叢處理單元陣列包含了8個處理單元。實驗結果達到在150MHz的操作頻率下,此一加速器達到每秒1440億次計算及每秒推論40.1張圖片的效能。


    Deep convolutional neural networks (DCNNs) are widely used for the artificial intelligence applications, e.g., object recognition and image classification. A modern DCNN model usually needs a huge amount of computations and data. To meet the performance requirement of applications, an accelerator is usually designed to execute the computation of DCNN.
    In this thesis, we consider a DCNN inference system using a DRAM to store data and an accelerator to execute the computation. An architecture exploration method based on the minimization of difference between DRAM data access time and computation time is proposed to define the architecture of accelerator. The accelerator consists of multiple clusters of processing elements (PEs), a reconfigurable memory unit, and a controller. A cluster of PEs is connected to the reconfigurable memory unit through a switch box. The reconfigurable memory unit consists of three static random access memories which sizes can be dynamically changed to fit the requirement of different convolutional layers. The configurations of PE array and reconfigurable memory are determined by sublayer-based parameters decision flow which can gain 4.2% and 17.4% increment of hardware resource utilization for convolutional layers and DCNN model in comparison with existing works. We implement the MobileNet V1 model in Xilinx ZCU-102 evaluation board using the proposed reconfigurable accelerator architecture with 1092KB SRAM and four PE clusters in which each cluster has 8 PEs. Ex-
    perimental results show that 144 GOPS and 40.1 FPS can be achieved under 100MHz clock rate.

    1 Introduction 1 1.1 Deep Neural Network 1 1.2 Deep Convolutional Neural Network Accelerator Architecture 4 1.3 Previous Work 5 1.3.1 Single Instruction Multiple Data Stream DNN Accelerator Architecture 5 1.3.2 Systolic Array DNN Accelerator Architecture 6 1.4 Motivation 6 1.5 Contribution 7 1.6 Thesis Organization 7 2 Architecture Exploration of Reconfigurable DNN Accelerator 8 2.1 DNN Inference System 8 2.2 Roofline Model Analysis and Sublayer-Based Pipeline Inference Flow 9 2.3 For Loop Analysis of Convolution Operation 12 2.4 Sublayer-Based Parameters Decision Flow 14 2.4.1 Tiling Factors Analysis 15 2.4.2 Loop Unrolling Factors Analysis 20 2.4.3 On-Chip Memory Analysis 21 2.5 Analysis Results 24 3 Proposed Reconfigurable DNN Accelerator Architecture 30 3.1 Architecture of Reconfigurable DNN Accelerator 30 3.2 Micro Instruction Set 31 3.3 On-chip Interconnections and On-Chip Data Reuse 34 3.3.1 Input Feature Map Data Reuse Strategy 35 3.3.2 Weight Data Reuse Strategy 35 3.3.3 Output Feature Map Data Flow 37 3.4 PE Clusters 38 3.5 Subsampling and Fully-Connected Layers 40 3.6 Analysis Results41 4 A Case Study of MobileNet V1 44 4.1 MobileNet V1 44 4.2 Validation Platform 48 4.2.1 Platform Resources of Xilinx ZCU-102 Evaluation Board 48 4.2.2 AXI-4 Interface Protocol 49 4.3 Implementation Details 49 4.3.1 VIVADO Block Diagram 49 4.3.2 PE Array Clusters and Reconfigurable On-Chip Memory 51 4.4 Implementation Results 54 5 Conclusion and Future Work 60 5.1 Conclusion 60 5.2 Future Work 61

    [1] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4,pp. 65–76, 2009.
    [2] Micron, 4Gb: x4, x8, x16 DDR3 SDRAM Features.
    [3] APMemory, 1Gb DDR3 SDRAM Specification.
    [4] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W.Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
    [5] P. G. Xilinx Inc, LogiCORE IP AXI Master Burst v2.0.
    [6] U. G. Xilinx Inc, UltraScale Architecture DSP Slice.
    [7] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: toward uniformed
    representation and acceleration for deep convolutional neural networks,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, no. 11,pp. 2072–2085, 2018.
    [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems,2012, pp. 1097–1105.
    [9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
    [10] M. Naphade, D. C. Anastasiu, A. Sharma, V. Jagrlamudi, H. Jeon, K. Liu, M.-C. Chang,
    S. Lyu, and Z. Gao, “The NVIDIA AI city challenge,” in 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City
    Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), 2017, pp. 1–6.
    [11] A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Very deep convolutional networks for natural language processing,” arXiv preprint arXiv:1606.01781, vol. 2, 2016.
    [12] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–1955, 2015.
    [13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: unified, realtime object detection,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2016, pp. 779–788.
    [14] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space exploration of FPGA-based deep convolutional neural networks,” in 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), 2016, pp. 575–580.
    [15] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning,” ACM
    SIGARCH Computer Architecture News, vol. 42, no. 1, pp. 269–284, 2014.
    [16] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al.,
    “Dadiannao: a machine-learning supercomputer,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 609–622.
    [17] A. Carbon, J.-M. Philippe, O. Bichler, R. Schmit, B. Tain, D. Briand, N. Ventroux, M. Paindavoine, and O. Brousse, “Pneuro: a scalable energy-efficient programmable hardware accelerator for neural networks,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pp. 1039–1044.
    [18] Y. Wang, H. Li, and X. Li, “A case of on-chip memory subsystem design for lowpower CNN accelerators,” IEEE Transactions on Computer-Aided Design of Integrated
    Circuits and Systems, vol. 37, no. 10, pp. 1971–1984, 2017.
    [19] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: ineffectual-neuron-free deep neural network computing,” ACM SIGARCH
    Computer Architecture News, vol. 44, no. 3, pp. 1–13, 2016.
    [20] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2016.
    [21] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer
    Architecture, 2017, pp. 1–12.
    [22] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, “Deep convolutional neural network architecture with reconfigurable computation patterns,” IEEE Transactions
    on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 8, pp. 2220–2233, 2017.
    [23] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015, pp.
    161–170.
    [24] M. Shahshahani, P. Goswami, and D. Bhatia, “Memory optimization techniques for FPGA based CNN implementations,” in 2018 IEEE 13th Dallas Circuits and Systems Conference (DCAS), 2018, pp. 1–6.
    [25] Y.-J. Lin and T. S. Chang, “Data and hardware efficient design for convolutional neural
    network,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 5, pp. 1642–1651, 2017.
    [26] U. G. Xilinx Inc, Zynq-7000 All Programmable SoC.
    [27] “IMAGENET,” http://image-net.org/, accessed: 2020-06-09.
    [28] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
    [29] U. G. Xilinx Inc, MPSoC Technical Reference Manual.
    [30] D. S. Xilinx Inc, Zynq-7000 All Programmable SoC Overview, Advance Product Specification.
    [31] ——, Zynq UltraScale+ MPSoC Data Sheet.
    [32] ARM, AMBA AXI and ACE Protocol Specification.
    [33] P. G. Xilinx Inc, LogiCORE IP Block Memory Generator v8.2.
    [34] ——, Integrated Logic Analyzer v6.2.
    [35] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang,
    “Angel-Eye: a complete design flow for mapping CNN onto embedded FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37,
    no. 1, pp. 35–47, 2017.
    [36] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Automatic compilation of diverse CNNs onto high-performance FPGA accelerators,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
    [37] X. Qu, Z. Huang, N. Mao, Y. Xu, G. Cai, and Z. Fang, “A grain-adaptive computing structure for FPGA CNN acceleration,” in 2019 IEEE 13th International Conference
    on ASIC (ASICON), 2019, pp. 1–4.

    QR CODE
    :::