基于FPGA的卷積神經(jīng)網(wǎng)絡(luò)硬件加速器設(shè)計
doi: 10.11999/JEIT190058
-
華南理工大學(xué)電子與信息學(xué)院 ??廣州 ??510641
基金項目: 廣東省科技計劃項目(2014B090910002)
Design of Convolutional Neural Networks Hardware Acceleration Based on FPGA
-
School of Electronics and Information Engineering, South China University of Technology, Guangzhou 510641, China
Funds: The Science and Technology Project of Guangdong Provience (2014B090910002)
-
摘要: 針對卷積神經(jīng)網(wǎng)絡(luò)(CNN)計算量大、計算時間長的問題,該文提出一種基于現(xiàn)場可編程邏輯門陣列(FPGA)的卷積神經(jīng)網(wǎng)絡(luò)硬件加速器。首先通過深入分析卷積層的前向運算原理和探索卷積層運算的并行性,設(shè)計了一種輸入通道并行、輸出通道并行以及卷積窗口深度流水的硬件架構(gòu)。然后在上述架構(gòu)中設(shè)計了全并行乘法-加法樹模塊來加速卷積運算和高效的窗口緩存模塊來實現(xiàn)卷積窗口的流水線操作。最后實驗結(jié)果表明,該文提出的加速器能效比達(dá)到32.73 GOPS/W,比現(xiàn)有的解決方案高了34%,同時性能達(dá)到了317.86 GOPS。
-
關(guān)鍵詞:
- 卷積神經(jīng)網(wǎng)絡(luò) /
- 硬件加速 /
- 現(xiàn)場可編程邏輯門陣列 /
- 計算并行 /
- 深度流水
Abstract: Considering the large computational complexity and the long-time calculation of Convolutional Neural Networks (CNN), an Field-Programmable Gate Array(FPGA)-based CNN hardware accelerator is proposed. Firstly, by deeply analyzing the forward computing principle and exploring the parallelism of convolutional layer, a hardware architecture in which parallel for the input channel and output channel, deep pipeline for the convolution window is presented. Then, a full parallel multi-addition tree is designed to accelerate convolution and efficient window buffer to implement deep pipelining operation of convolution window. The experimental results show that the energy efficiency ratio of proposed accelerator reaches 32.73 GOPS/W, which is 34% higher than the existing solutions, as the performance reaches 317.86 GOPS. -
表 1 卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)參數(shù)
層名稱 層結(jié)構(gòu) 參數(shù)量(個) 卷積層1 卷積核大小3×3,卷積核個數(shù)15,步長1 150 激活層1 無 0 池化層1 池化大小2×2,步長2 0 卷積層2 卷積核大小6×6,卷積核個數(shù)20,步長1 10820 激活層2 無 0 池化層2 池化大小2×2,步長2 0 全連接層 輸出神經(jīng)元個數(shù)10 3210 下載: 導(dǎo)出CSV
表 3 與文獻(xiàn)FPGA硬件加速對比
文獻(xiàn)[7] 文獻(xiàn)[11] 文獻(xiàn)[12] 本文方法 FPGA Zynq XC7Z045 ZynqXC7Z045 Virtex-7 VX690T Cyclone V 5CGXF 頻率(MHz) 150 100 150 100 DSP資源 780(86.7%) 824(91.6%) 1376(38%) 342(100%) 量化策略 16 bit fixed 16 bit fixed 16 bit fixed 16 bit fixed 功耗(W) 9.630 9.400 25.000 9.711 性能(GOPS) 136.97 229.50 570.00 317.86 能效比(GOPS/W) 14.22 24.42 22.80 32.73 下載: 導(dǎo)出CSV
-
LIU Weibo, WANG Zidong, LIU Xiaohui, et al. A survey of deep neural network architectures and their applications[J]. Neurocomputing, 2017, 234: 11–26. doi: 10.1016/j.neucom.2016.12.038 HAN Song, MAO Huizi, and DALLY W J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding[J]. arXiv preprint arXiv: 1510.00149, 2015. COATES A, HUVAL B, WANG Tao, et al. Deep learning with COTS HPC systems[C]. Proceedings of the 30th International Conference on International Conference on Machine Learning, Atlanta, USA, 2013: III-1337–III-1345. JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]. Proceedings of the 44th Annual International Symposium on Computer Architecture, Toronto, Canada, 2017: 1–12. doi: 10.1145/3079856.3080246. MOTAMEDI M, GYSEL P, AKELLA V, et al. Design space exploration of FPGA-based deep convolutional neural networks[C]. Proceedings of the 21st Asia and South Pacific Design Automation Conference, Macau, China, 2016: 575–580. doi: 10.1109/ASPDAC.2016.7428073. ZHANG Jialiang and LI Jing. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network[C]. Proceedings of 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2017: 25–34. doi: 10.1145/3020078.3021698. QIU Jiantao, WANG Jie, YAO Song, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]. Proceedings of 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2016: 26–35. doi: 10.1145/2847263.2847265. 余奇. 基于FPGA的深度學(xué)習(xí)加速器設(shè)計與實現(xiàn)[D]. [碩士論文], 中國科學(xué)技術(shù)大學(xué), 2016: 30–38.YU Qi. Deep learning accelerator design and implementation based on FPGA[D]. [Master dissertation], University of Science and Technology of China, 2016: 30–38. LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. doi: 10.1109/5.726791 ABADI M, BARHAM P, CHEN Jianmin, et al. Tensorflow: A system for large-scale machine learning[C]. Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, Savannah, USA, 2016: 265–283. XIAO Qingcheng, LIANG Yun, LU Liqiang, et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs[C]. Proceedings of the 54th Annual Design Automation Conference, Austin, USA, 2017: 62. doi: 10.1145/3061639.3062244. SHEN Junzhong, HUANG You, WANG Zelong, et al. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA[C]. Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2018: 97–106. doi: 10.1145/3174243.3174257. -