基于FPGA的卷積神經(jīng)網(wǎng)絡(luò)硬件加速器設(shè)計

秦華標(biāo); 曹欽平

doi:10.11999/JEIT190058

基于FPGA的卷積神經(jīng)網(wǎng)絡(luò)硬件加速器設(shè)計

doi: 10.11999/JEIT190058

秦華標(biāo)^,,
曹欽平

華南理工大學(xué)電子與信息學(xué)院 ??廣州 ??510641

基金項目: 廣東省科技計劃項目(2014B090910002)

詳細(xì)信息

作者簡介:
秦華標(biāo)：男，1967年生，教授，研究方向為智能信息處理、無線通信網(wǎng)絡(luò)、嵌入式系統(tǒng)、FPGA設(shè)計

曹欽平：男，1995年生，碩士生，研究方向為集成電路設(shè)計

通訊作者:
秦華標(biāo)　eehbqin@scut.edu.cn

中圖分類號: TP331
計量
- 文章訪問數(shù): 4983
- HTML全文瀏覽量: 2192
- PDF下載量: 349
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2019-01-22
- 修回日期: 2019-06-10
- 網(wǎng)絡(luò)出版日期: 2019-06-20
- 刊出日期: 2019-11-01

Design of Convolutional Neural Networks Hardware Acceleration Based on FPGA

Huabiao QIN^,,
Qinping CAO

School of Electronics and Information Engineering, South China University of Technology, Guangzhou 510641, China

Funds: The Science and Technology Project of Guangdong Provience (2014B090910002)

摘要

摘要: 針對卷積神經(jīng)網(wǎng)絡(luò)(CNN)計算量大、計算時間長的問題，該文提出一種基于現(xiàn)場可編程邏輯門陣列(FPGA)的卷積神經(jīng)網(wǎng)絡(luò)硬件加速器。首先通過深入分析卷積層的前向運算原理和探索卷積層運算的并行性，設(shè)計了一種輸入通道并行、輸出通道并行以及卷積窗口深度流水的硬件架構(gòu)。然后在上述架構(gòu)中設(shè)計了全并行乘法-加法樹模塊來加速卷積運算和高效的窗口緩存模塊來實現(xiàn)卷積窗口的流水線操作。最后實驗結(jié)果表明，該文提出的加速器能效比達(dá)到32.73 GOPS/W，比現(xiàn)有的解決方案高了34%，同時性能達(dá)到了317.86 GOPS。
- 卷積神經(jīng)網(wǎng)絡(luò) /
- 硬件加速 /
- 現(xiàn)場可編程邏輯門陣列 /
- 計算并行 /
- 深度流水
Abstract: Considering the large computational complexity and the long-time calculation of Convolutional Neural Networks (CNN), an Field-Programmable Gate Array(FPGA)-based CNN hardware accelerator is proposed. Firstly, by deeply analyzing the forward computing principle and exploring the parallelism of convolutional layer, a hardware architecture in which parallel for the input channel and output channel, deep pipeline for the convolution window is presented. Then, a full parallel multi-addition tree is designed to accelerate convolution and efficient window buffer to implement deep pipelining operation of convolution window. The experimental results show that the energy efficiency ratio of proposed accelerator reaches 32.73 GOPS/W, which is 34% higher than the existing solutions, as the performance reaches 317.86 GOPS.
- Convolutional Neural Networks (CNN) /
- Hardware acceleration /
- FPGA /
- Parallel computation /
- Deep pipeline

HTML全文

圖 1 卷積層運算過程

下載: 全尺寸圖片幻燈片

圖 2 1個輸入通道的卷積運算過程

下載: 全尺寸圖片幻燈片

圖 3 N個輸入通道的卷積窗口并行計算

下載: 全尺寸圖片幻燈片

圖 4 累加器并行運算

下載: 全尺寸圖片幻燈片

圖 5 經(jīng)典加法樹

下載: 全尺寸圖片幻燈片

圖 6 本文設(shè)計的加法樹

下載: 全尺寸圖片幻燈片

圖 7 乘法-加法樹模塊

下載: 全尺寸圖片幻燈片

圖 8 卷積窗口數(shù)據(jù)重用

下載: 全尺寸圖片幻燈片

圖 9 窗口緩存結(jié)構(gòu)

下載: 全尺寸圖片幻燈片

圖 10 窗口緩存時序

下載: 全尺寸圖片幻燈片

圖 11 輸出通道并行模塊

下載: 全尺寸圖片幻燈片

圖 12 并行加速方案結(jié)構(gòu)

下載: 全尺寸圖片幻燈片

圖 13 卷積窗口流水線

下載: 全尺寸圖片幻燈片

圖 14 FPGA, CPU, GPU的性能對比

下載: 全尺寸圖片幻燈片

表 1 卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)參數(shù)

層名稱	層結(jié)構(gòu)	參數(shù)量（個）
卷積層1	卷積核大小3×3，卷積核個數(shù)15，步長1	150
激活層1	無	0
池化層1	池化大小2×2，步長2	0
卷積層2	卷積核大小6×6，卷積核個數(shù)20，步長1	10820
激活層2	無	0
池化層2	池化大小2×2，步長2	0
全連接層	輸出神經(jīng)元個數(shù)10	3210

下載: 導(dǎo)出CSV

表 2 FPGA資源消耗情況

	資源	比例(%)
ALMs	89423/113560	79
Block Memory	730151/12492800	6
DSPs	342/342	100

下載: 導(dǎo)出CSV

表 3 與文獻(xiàn)FPGA硬件加速對比

	文獻(xiàn)[7]	文獻(xiàn)[11]	文獻(xiàn)[12]	本文方法
FPGA	Zynq XC7Z045	ZynqXC7Z045	Virtex-7 VX690T	Cyclone V 5CGXF
頻率(MHz)	150	100	150	100
DSP資源	780(86.7%)	824(91.6%)	1376(38%)	342(100%)
量化策略	16 bit fixed	16 bit fixed	16 bit fixed	16 bit fixed
功耗(W)	9.630	9.400	25.000	9.711
性能(GOPS)	136.97	229.50	570.00	317.86
能效比(GOPS/W)	14.22	24.42	22.80	32.73

下載: 導(dǎo)出CSV

參考文獻(xiàn)(12)

LIU Weibo, WANG Zidong, LIU Xiaohui, et al. A survey of deep neural network architectures and their applications[J]. Neurocomputing, 2017, 234: 11–26. doi: 10.1016/j.neucom.2016.12.038

HAN Song, MAO Huizi, and DALLY W J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding[J]. arXiv preprint arXiv: 1510.00149, 2015.

COATES A, HUVAL B, WANG Tao, et al. Deep learning with COTS HPC systems[C]. Proceedings of the 30th International Conference on International Conference on Machine Learning, Atlanta, USA, 2013: III-1337–III-1345.

JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]. Proceedings of the 44th Annual International Symposium on Computer Architecture, Toronto, Canada, 2017: 1–12. doi: 10.1145/3079856.3080246.

MOTAMEDI M, GYSEL P, AKELLA V, et al. Design space exploration of FPGA-based deep convolutional neural networks[C]. Proceedings of the 21st Asia and South Pacific Design Automation Conference, Macau, China, 2016: 575–580. doi: 10.1109/ASPDAC.2016.7428073.

ZHANG Jialiang and LI Jing. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network[C]. Proceedings of 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2017: 25–34. doi: 10.1145/3020078.3021698.

QIU Jiantao, WANG Jie, YAO Song, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]. Proceedings of 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2016: 26–35. doi: 10.1145/2847263.2847265.

余奇. 基于FPGA的深度學(xué)習(xí)加速器設(shè)計與實現(xiàn)[D]. [碩士論文], 中國科學(xué)技術(shù)大學(xué), 2016: 30–38.

YU Qi. Deep learning accelerator design and implementation based on FPGA[D]. [Master dissertation], University of Science and Technology of China, 2016: 30–38.

LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. doi: 10.1109/5.726791

ABADI M, BARHAM P, CHEN Jianmin, et al. Tensorflow: A system for large-scale machine learning[C]. Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, Savannah, USA, 2016: 265–283.

XIAO Qingcheng, LIANG Yun, LU Liqiang, et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs[C]. Proceedings of the 54th Annual Design Automation Conference, Austin, USA, 2017: 62. doi: 10.1145/3061639.3062244.

SHEN Junzhong, HUANG You, WANG Zelong, et al. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA[C]. Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2018: 97–106. doi: 10.1145/3174243.3174257.

相關(guān)文章

施引文獻(xiàn)

資源附件(0)

訪問統(tǒng)計