| 研究生: |
徐麒惟 Chi-Wei Hsu |
|---|---|
| 論文名稱: |
一個適用於量化深度神經網路且可調整精確度的處理單元設計: 一種階層式的設計方法 A Precision Reconfigurable Process Element Design for Quantized Deep Neural Networks: A Hierarchical Approach |
| 指導教授: |
周景揚
Jing-Yang Jou |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 54 |
| 中文關鍵詞: | 量化神經網路 、運算單元 、可重組式設計 |
| 外文關鍵詞: | Quantized Neural Networks (QNN), Processing Element (PE), Reconfigurable Design |
| 相關次數: | 點閱:18 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
卷積神經網路 (Convolution Neural Networks, CNN)在現今發展得十分迅速,主要使用在影像辨識、自駕車、物件偵測……等等。當我們應用CNN時,精準度以及資料大小是兩個重要的指標來計算效能以及運算效率。在傳統的CNN網路中,大部分都是以浮點數32bits來做計算以保持高水平的精準度。然而,要使用浮點數32bits運算必須用到32bits的乘加器 (MAC),這樣除了會在運算效率上造成瓶頸之外,還會使功耗大幅的上升,因此現今的研究者都在是利於找出減少資料量以此為加速的方法。量化(Quantization)是其中一種可以在精準度不下降太多的情況下來降低資料量已獲得加速的好處以及減少運算複雜度的一個方法。在CNN網路中,每次層所需要的位元數都不盡相同,而為了權衡更好的運算效率及精準度,不同的位元的運算會用在CNN網路的不同層數中,以增加運算效率。在以上的前提下,可以調整位元數的運算單元(Processing Element, PE)可以支援不同元位元的運算,像是 8bits x 8bits、 8bits x 4bits、4bits x 4bits以及2bits x 2bits。而我們所提出的這個架構屬於階層式的架構,這樣可以在設計過程中減少一些多餘的硬體,降低整體晶片的面積,而為了提升運算速度,我們提出的8bits x 8bits PE 可以做到兩級的平行化。而在實驗的部分,我們採用90nm的製程,從實驗結果中我們可以發現,跟先前的論文相比,我們2bits x 2bits面積可以減少57.5% - 68%,而在8bits x 8bits PE中,使用平行化架構可以讓8bits x 8bits的運算速度跟4bits x 4bits PE的運算速度相當。
In deep learning field, Convolution Neural Networks (CNNs) have been achieved a significant success in many fields such as visual imagery analysis, self-driving car, respectively. However, data size and the accuracy of each system are the major target to estimate the efficient and effective computations. In conventional CNN models, 32bits data are frequently used to maintain high accuracy. However, performing a bunch of 32bits multiply-and-accumulate (MAC) operations causes significant computing efforts as well as power consumptions. Therefore, recently researchers develop various methods to reduce data size and speed up calculations. Quantization is one of the techniques which reduces the number of bits of the data as well as the computational complexity at the cost of accuracy loss. To provide better computation effort and accuracy trade-off, different bit number may be applied to different layers within a CNN model. Therefore, a flexible processing element (PE) which can support operations of different bit numbers is in demand. In this work, we propose a hierarchy-based reconfigurable processing element (PE) structure that can support 8bits x 8bits, 8bits x 4bits, 4bits x 4bits and 2bits x 2bits operations. The structure we propose applies the concept of hierarchical structure that can avoid the redundant hardware in the design. To improve the calculation speed, our 8bits x 8bits PE applies two stage pipelines. The experimental results with 90nm technology show that in 2bits x 2bits PE, we can save the area by 57.5% to 60% compared to a Precision-Scalable accelerator. In the 8bits x 8bits PE, the two-stage pipelines can maintain almost the same calculation speed of the 4bits x 4 bits PE.
[1] J. Albericio et al., “Cnvlutin: Ineffectual-neuron-free deep neural network computing”, in Proc. of ACM SIGARCH Computer Architecture News, 2016.
[2] J. Choi et al., "Accurate and efficient 2-bit quantized neural networks", in Proc. of the 2nd SysML Conference, Mar. 2019.
[3] T. Chen et al., “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning”, in Proc. of ACM SIGARCH Computer Architecture News, 2014.
[4] YJ. Chen et al., "Ct Image Denoising With Encoder-Decoder Based Graph Convolutional Networks", in Proc. of IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE, Apr. 2021.
[5] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the sensor”, in Proc. of the 42nd Annual International Symposium on Computer Architecture (ISCA), Jun. 2015.
[6] I. Hubara et al., “Quantized neural networks: Training neural networks with low precision weights and activations”, in Proc. of The Journal of Machine Learning Research, 2017.
[7] K. He et al., “Deep residual learning for image recognition”, in Proc. of IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016.
[8] D. Kim et al., “Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory”, in Proc. of ACM SIGARCH Computer Architecture News, 2016.
[9] A. Krizhevsky et al., "ImageNet classification with deep convolutional neural networks", in Proc. of Commun. ACM 60, Jun. 2017.
[10] Y. LeCun et al., “Gradient-based learning applied to document recognition”, in Proc. of IEEE, Nov. 1998.
[11] F. LI et al., “Ternary weight networks”, in Proc. of arXiv, 2016.
[12] W. Liu et al., "A Precision-Scalable Energy-Efficient Convolutional Neural Network Accelerator", in Proc. of IEEE Transactions on Circuits and Systems I: Regular Papers, Oct. 2020.
[13] D. Liu et al., “Pudiannao: A polyvalent machine learning accelerator”, in Proc. of ACM SIGARCH Computer Architecture News, 2015.
[14] S. Liu et al. “Cambricon: An instruction set architecture for neural networks”, in Proc. of ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Jun. 2016
[15] D. Lin et al., "Fixed point quantization of deep convolutional networks", in Proc. of International conference on machine learning, PMLR, Jun. 2016.
[16] A. Mishra et al., "Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy", in Proc. of arXiv, 2017.
[17] A. Mishra et al., "WRPN: Wide reduced-precision networks", In Proc. of arXiv, 2017.
[18] O. Russakovsky et al. "Imagenet large scale visual recognition challenge", in Proc. of International journal of computer vision, 2015.
[19] S. Ren et al., "Faster r-cnn: Towards real-time object detection with region proposal networks", in Proc. of arXiv, 2015.
[20] J. Redmon et al., "You only look once: Unified, real-time object detection", in Proc. of the IEEE conference on computer vision and pattern recognition (CVPR), Jun. 2016.
[21] B. Reagen et al., “Minerva: Enabling low-power, highly-accurate deep neural network accelerators”, in Proc. of ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Jun. 2016.
[22] H. Sharma et al., "Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network", in Proc. of ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Jun. 2018
[23] C. Szegedy et al., “Going deeper with convolutions”, in Proc. of IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015.
[24] P. Wang et al., "Two-Step Quantization for Low-bit Neural Networks", in Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018.
[25] Y. Wang et al., "FPAP: A Folded Architecture for Energy-Quality Scalable Convolutional Neural Networks," in Proc. of IEEE Transactions on Circuits and Systems I: Regular Papers, Jan. 2019.
[26] Z. Wang et al., "Lightweight Run-Time Working Memory Compression for Deployment of Deep Neural Networks on Resource-Constrained MCUs." in Proc. of the 26th Asia and South Pacific Design Automation Conference (ASP DAC), Jan. 2021.
[27] X. Xu et al., "DAC-SDC Low Power Object Detection Challenge for UAV Applications", in Proc. of IEEE Transactions on Pattern Analysis and Machine Intelligence, Feb. 2021.
[28] Z. Yao et al., "A machine learning-based pulmonary venous obstruction prediction model using clinical data and CT image", in Proc. of International Journal of Computer Assisted Radiology and Surgery, 2021.
[29] SJ. Zhang et al., “Cambricon-X: An accelerator for sparse neural networks”, in Proc. of 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2016.