基于快速濾波算法的卷積神經(jīng)網(wǎng)絡(luò)加速器設(shè)計(jì)

王巍; 周凱利; 王伊昌; 王廣; 袁軍

doi:10.11999/JEIT190037

基于快速濾波算法的卷積神經(jīng)網(wǎng)絡(luò)加速器設(shè)計(jì)

doi: 10.11999/JEIT190037

重慶郵電大學(xué) 光電工程學(xué)院/國(guó)際半導(dǎo)體學(xué)院 ??重慶 ??400065

基金項(xiàng)目: 國(guó)家自然科學(xué)基金(61404019)，重慶市集成電路產(chǎn)業(yè)重大主題專(zhuān)項(xiàng)(cstc2018jszx-cyztzx0211, cstc2018jszx-cyztzx0217)

詳細(xì)信息

作者簡(jiǎn)介:
王?。耗?，1967年生，博士后，教授，研究方向?yàn)榧呻娐吩O(shè)計(jì)

周凱利：女，1991年生，碩士生，研究方向?yàn)閿?shù)字集成電路設(shè)計(jì)

王伊昌：男，1996年生，碩士生，研究方向?yàn)槟M集成電路設(shè)計(jì)

王廣：男，1994年生，碩士生，研究方向?yàn)榘雽?dǎo)體光電器件設(shè)計(jì)

袁軍：男，1984年生，博士，副教授，研究方向?yàn)閿?shù)?；旌霞呻娐吩O(shè)計(jì)

通訊作者:
周凱利　2508005354@qq.com

中圖分類(lèi)號(hào): TN432
計(jì)量
- 文章訪(fǎng)問(wèn)數(shù): 3349
- HTML全文瀏覽量: 1510
- PDF下載量: 141
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2019-01-15
- 修回日期: 2019-03-20
- 網(wǎng)絡(luò)出版日期: 2019-05-23
- 刊出日期: 2019-11-01

Design of Convolutional Neural Networks Accelerator Based on Fast Filter Algorithm

College of Electronics Engineering/International Semiconductor College, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Funds: The National Natural Science Foundation of China (61404019), Major Themes of Integrated Circuit Industry in Chongqing (cstc2018jszx-cyztzx0211, cstc2018jszx-cyztzx0217)

摘要

摘要: 為減少卷積神經(jīng)網(wǎng)絡(luò)(CNN)的計(jì)算量，該文將2維快速濾波算法引入到卷積神經(jīng)網(wǎng)絡(luò)，并提出一種在FPGA上實(shí)現(xiàn)CNN逐層加速的硬件架構(gòu)。首先，采用循環(huán)變換方法設(shè)計(jì)行緩存循環(huán)控制單元，用于有效地管理不同卷積窗口以及不同層之間的輸入特征圖數(shù)據(jù)，并通過(guò)標(biāo)志信號(hào)啟動(dòng)卷積計(jì)算加速單元來(lái)實(shí)現(xiàn)逐層加速；其次，設(shè)計(jì)了基于4并行快速濾波算法的卷積計(jì)算加速單元，該單元采用若干小濾波器組成的復(fù)雜度較低的并行濾波結(jié)構(gòu)來(lái)實(shí)現(xiàn)。利用手寫(xiě)數(shù)字集MNIST對(duì)所設(shè)計(jì)的CNN加速器電路進(jìn)行測(cè)試，結(jié)果表明：在xilinx kintex7平臺(tái)上，輸入時(shí)鐘為100 MHz時(shí)，電路的計(jì)算性能達(dá)到了20.49 GOPS，識(shí)別率為98.68%?？梢?jiàn)通過(guò)減少CNN的計(jì)算量，能夠提高電路的計(jì)算性能。
- 卷積神經(jīng)網(wǎng)絡(luò) /
- 快速濾波算法 /
- FPGA /
- 并行結(jié)構(gòu)
Abstract: In order to reduce the computational complexity of Convolutional Neural Network(CNN), the two-dimensional fast filtering algorithm is introduced into the CNN, and a hardware architecture for implementing CNN layer-by-layer acceleration on FPGA is proposed. Firstly, the line buffer loop control unit is designed by using the cyclic transformation method to manage effectively different convolution windows and the input feature map data between different layers, and starts the convolution calculation acceleration unit by the flag signal to realize layer-by-layer acceleration. Secondly, a convolution calculation accelerating unit based on 4 parallel fast filtering algorithm is designed. The unit is realized by a less complex parallel filtering structure composed of several small filters. Using the handwritten digit set MNIST to test the designed CNN accelerator circuit, the results show that on the xilinx kintex7 platform, when the input clock is 100 MHz, the computational performance of the circuit reaches 20.49 GOPS, and the recognition rate is 98.68%. It can be seen that the computational performance of the circuit can be improved by reducing the amount of calculation of the CNN.
- Convolution Neural Network(CNN) /
- Fast filter algorithms /
- FPGA /
- Parallel structure

HTML全文

圖 2 卷積層的卷積計(jì)算過(guò)程

下載: 全尺寸圖片幻燈片

圖 1 卷積神經(jīng)網(wǎng)絡(luò)的結(jié)構(gòu)

下載: 全尺寸圖片幻燈片

圖 3 卷積神經(jīng)網(wǎng)絡(luò)的逐層加速硬件架構(gòu)圖

下載: 全尺寸圖片幻燈片

圖 4 行緩存循環(huán)控制單元

下載: 全尺寸圖片幻燈片

圖 5 卷積計(jì)算加速單元的結(jié)構(gòu)圖

下載: 全尺寸圖片幻燈片

圖 6 各部分的具體電路

下載: 全尺寸圖片幻燈片

表 1 MATLAB實(shí)現(xiàn)與FPGA實(shí)現(xiàn)的比較

類(lèi)型	時(shí)間(ms/frams)	精度(bad/10000 frames)	數(shù)據(jù)類(lèi)型
.m文件	0.7854	1.19%	雙精度
.v 文件	0.01986	1.32%	16 bit定點(diǎn)數(shù)

下載: 導(dǎo)出CSV

表 2 卷積神經(jīng)網(wǎng)絡(luò)FPGA實(shí)現(xiàn)的性能比較

參數(shù)	文獻(xiàn)[4]	文獻(xiàn)[6]	文獻(xiàn)[7]	本文方案
FPGA	Virttex-7xc7vx485t	Zynqzc702	Virttex-7xc7vx485t	Kirtex-7xc7k325t
頻率(MHz)	100	166	150	100
時(shí)間(ms)	2.6368	0.1510	0.0254	0.0199
BRAM	27	96	0	30
DSP	20	95	638	284
FF	54075	27664	66346	36973
LUT	14832	38836	51125	51748
識(shí)別率(%)	98.62	99.01	96.80	98.68
GOPS	1.58	2.70	15.87	20.49

下載: 導(dǎo)出CSV

參考文獻(xiàn)(15)

ZHANG Chen, LI Peng, SUN Guangyu, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]. 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2015: 161–170.

KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[C]. The 25th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, 2012: 1097–1105.

DONG Han, LI Tao, LENG Jiabing, et al. GCN: GPU-based cube CNN framework for hyperspectral image classification[C]. The 201746th International Conference on Parallel Processing, Bristol, UK, 2017: 41–49.

GHAFFARI S and SHARIFIAN S. FPGA-based convolutional neural network accelerator design using high level synthesize[C]. The 20162nd International Conference of Signal Processing and Intelligent Systems, Tehran, Iran, 2016: 1–6.

CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127–138. doi: 10.1109/JSSC.2016.2616357

FENG Gan, HU Zuyi, CHEN Song, et al. Energy-efficient and high-throughput FPGA-based accelerator for Convolutional Neural Networks[C]. The 201613th IEEE International Conference on Solid-State and Integrated Circuit Technology, Hangzhou, China, 2016: 624–626.

ZHOU Yongmei and JIANG Jingfei. An FPGA-based accelerator implementation for deep convolutional neural networks[C]. The 20154th International Conference on Computer Science and Network Technology, Harbin, China, 2015: 829–832.

HOSEINI F, SHAHBAHRAMI A, and BAYAT P. An efficient implementation of deep convolutional neural networks for MRI segmentation[J]. Journal of Digital Imaging, 2018, 31(5): 738–747. doi: 10.1007/s10278-018-0062-2

HUANG Jiahao, WANG Tiejun, ZHU Xuhui, et al. A parallel optimization of the fast algorithm of convolution neural network on CPU[C]. The 201810th International Conference on Measuring Technology and Mechatronics Automation, Changsha, China, 2018: 5–9.

LAVIN A and GRAY S. Fast algorithms for convolutional neural networks[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, 4013–4021.

VINCHURKAR P P, RATHKANTHIWAR S V, and KAKDE S M. HDL implementation of DFT architectures using winograd fast Fourier transform algorithm[C]. The 2015 5th International Conference on Communication Systems and Network Technologies, Gwalior, India, 2015: 397–401.

WANG Xuan, WANG Chao, and ZHOU Xuehai. Work-in-progress: WinoNN: Optimising FPGA-based neural network accelerators using fast winograd algorithm[C]. 2018 International Conference on Hardware/Software Codesign and System Synthesis, Turin, Italy, 2018: 1–2.

NAITO Y, MIYAZAKI T, and KURODA I. A fast full-search motion estimation method for programmable processors with a multiply-accumulator[C]. 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, USA, 1996: 3221–3224.

JIANG Jingfei, HU Rongdong, and LUJáN M. A flexible memory controller supporting deep belief networks with fixed-point arithmetic[C]. The 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, Cambridge, USA, 2013: 144–152.

LI Sicheng, WEN Wei, WANG Yu, et al. An FPGA design framework for CNN sparsification and acceleration[C]. The 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines, Napa, USA, 2017: 28.

相關(guān)文章

施引文獻(xiàn)

資源附件(0)

訪(fǎng)問(wèn)統(tǒng)計(jì)