利用參數(shù)稀疏性的卷積神經(jīng)網(wǎng)絡(luò)計算優(yōu)化及其FPGA加速器設(shè)計

劉勤讓; 劉崇陽

doi:10.11999/JEIT170819

利用參數(shù)稀疏性的卷積神經(jīng)網(wǎng)絡(luò)計算優(yōu)化及其FPGA加速器設(shè)計

doi: 10.11999/JEIT170819 cstr: 32379.14.JEIT170819

劉勤讓¹,
劉崇陽¹

基金項目:

國家科技重大專項(2016ZX01012101)，國家自然科學(xué)基金(61572520, 61521003)

計量
- 文章訪問數(shù): 2579
- HTML全文瀏覽量: 449
- PDF下載量: 463
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2017-08-21
- 修回日期: 2018-01-05
- 刊出日期: 2018-06-19

Calculation Optimization for Convolutional Neural Networks and FPGA-based Accelerator Design Using the Parameters Sparsity

LIU Qinrang¹,
LIU Chongyang¹

Funds:

The National Science and Technology Major Project of the Ministry of Science and Technology of China (2016ZX01012101), The National Natural Science Foundation of China (61572520, 61521003)

摘要

摘要: 針對卷積神經(jīng)網(wǎng)絡(luò)(CNN)在嵌入式端的應(yīng)用受實時性限制的問題，以及CNN卷積計算中存在較大程度的稀疏性的特性，該文提出一種基于FPGA的CNN加速器實現(xiàn)方法來提高計算速度。首先，挖掘出CNN卷積計算的稀疏性特點；其次，為了用好參數(shù)稀疏性，把CNN卷積計算轉(zhuǎn)換為矩陣相乘；最后，提出基于FPGA的并行矩陣乘法器的實現(xiàn)方案。在Virtex-7 VC707 FPGA上的仿真結(jié)果表明，相比于傳統(tǒng)的CNN加速器，該設(shè)計縮短了19%的計算時間。通過稀疏性來簡化CNN計算過程的方式，不僅能在FPGA實現(xiàn)，也能遷移到其他嵌入式端。
- 卷積神經(jīng)網(wǎng)絡(luò) /
- 稀疏性 /
- 計算優(yōu)化 /
- 矩陣乘法器 /
- FPGA
Abstract: Concerning the problem of real-time restriction on the application of Convolution Neural Network (CNN) in embedded field, and the large degree of sparsity in CNN convolution calculations, this paper proposes an implement method of CNN accelerator based on FPGA to improve computation speed. Firstly, the sparseness characteristics of CNN convolution calculation are seeked out. Secondly, in order to use the parameters sparseness, CNN convolution calculations are converted to matrix multiplication. Finally, the implementation method of parallel matrix multiplier based on FPGA is proposed. Simulation results on the Virtex-7 VC707 FPGA show that the design shortens the calculation time by 19% compared to the traditional CNN accelerator. The method of simplifying the CNN calculation process by sparseness not only can be implemented on FPGA, but also can migrate to other embedded ends.
- Convolution Neural Network (CNN) /
- Sparseness /
- Computational optimization /
- Matrix multiplier /
- FPGA

HTML全文

參考文獻(xiàn)(24)

曾毅, 劉成林, 譚鐵牛. 類腦智能研究的回顧與展望[J]. 計算機(jī)學(xué)報, 2016, 39(1): 212-222. doi: 10.11897/SP.J.1016.2016. 00212.

ZENG Yi, LIU Chenglin, and TAN Tieniu. Retrospect and outlook of brain-inspired intelligence research[J]. Chinese Journal of Computers, 2016, 39(1): 212-222. doi: 10.11897/ SP.J.1016.2016.00212.

常亮, 鄧小明, 周明全, 等. 圖像理解中的卷積神經(jīng)網(wǎng)絡(luò)[J]. 自動化學(xué)報, 2016, 42(9): 1300-1312. doi: 10.16383/j.aas. 2016.c150800.

CHANG Liang, DENG Xiaoming, ZHOU Mingquan, et al. Convolutional neural networks in image understanding[J]. Acta Automatica Sinica, 2016, 42(9): 1300-1312. doi: 10.16383/j.aas.2016.c150800.

JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis Machine Intelligence, 2012, 35(1): 221-231. doi: 10.1109/TPAMI.2012.59.

CHAKRADHAR S, SANKARADAS M, JAKKULA V, et al. A dynamically configurable coprocessor for convolutional neural networks[J]. ACM Sigarch Computer Architecture News, 2010, 38(3): 247-257. doi: 10.1145/1816038.1815993.

KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[C]. International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, 2012: 1097-1105. doi: 10.1145/3065386.

SUDA N, CHANDRA V, DASIKA G, et al. Throughput- optimized openCL-based FPGA accelerator for large-scale convolutional neural networks[C]. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, California, USA, 2016: 16-25. doi: 10.1145/2847263.2847276.

QIU J, WANG J, YAO S, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]. ACM/ SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, California, USA, 2016: 26-35. doi: 10.1145 /2847263.2847265.

ANWAR S, HWANG K, and SUNG W. Fixed point optimization of deep convolutional neural networks for object recognition[C]. IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 2015: 1131-1135. doi: 10.1109/ICASSP.2015.7178146.

ZHANG C, LI P, SUN G, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]. ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, Monterey, California, USA, 2015: 161-170. doi: 10.1145/2684746.2689060.

SHEN Y, FERDMAN M, and MILDER P. Maximizing CNN accelerator effciency through resource partitioning[C]. Annual International Symposium on Computer Architecture, Toronto, ON, Canada, 2017: 535-547. doi: 10.1145/3140659. 3080221.

DU Z, FASTHUBER R, CHEN T, et al. ShiDianNao: Shifting vision processing closer to the sensor[C]. Annual International Symposium on Computer Architecture, Portland, Oregon, 2015: 92-104. doi: 10.1145/2749469. 2750389.

CHEN T, DU Z, SUN N, et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine- learning[C]. International Conference on Architectural Support for Programming Languages and Operating Systems, Salt Lake City, Utah, USA, 2014: 269-284. doi: 10.1145/ 2541940.2541967.

HADJIS S, ABUZAID F, ZHANG C, et al. Caffe con troll: Shallow ideas to speed up deep learning[C]. Proceedings of the Fourth Workshop on Data analytics, Melbourne, VIC, Australia, 2015: 1-4. doi: 10.1145/2799562.2799641.

YAVITS L, MORAD A, and GINOSAR R. Sparse matrix multiplication on an associative processor[J]. IEEE Transactions on Parallel Distributed Systems, 2015, 26(11): 3175-3183. doi: 10.1109/TPDS.2014.2370055.

CHELLAPILLA K, PURI S, and SIMARD P. High performance convolutional neural networks for document processing[C]. Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule, France, 2006: 1-6.

CHETLUR S, WOOLLEY C, VANDERMERSCH P, et al. CuDNN: Efficient primitives for deep learning[C]. International Conference on Neural Information Processing Systems, Montreal, Canada, 2014: 1-9.

田翔, 周凡, 陳耀武, 等. 基于FPGA的實時雙精度浮點矩陣乘法器設(shè)計[J]. 浙江大學(xué)學(xué)報(工學(xué)版), 2008, 42(9): 1611-1615. doi: 10.3785/j.issn.1008-973X.2008.09.027.

TIAN Xiang, ZHOU Fan, CHEN Yaowu, et al. Design of field programmable gate array based real-time double-precision floating-point matrix multiplier[J]. Journal of Zhejiang University (Engineering Science), 2008, 42(9): 1611-1615. doi: 10.3785/j.issn.1008-973X.2008.09.027.

JANG J, CHOI S B, and PRASANNA V K. Energy- and time-efficient matrix multiplication on FPGAs[J]. IEEE Transactions on Very Large Scale Integration Systems, 2005, 13(11): 1305-1319. doi: 10.1109/TVLSI.2005.859562.

KUMAR V B Y, JOSHI S, PATKAR S B, et al. FPGA based high performance double-precision matrix multiplication[J]. International Journal of Parallel Programming, 2010, 38(3/4): 322-338. doi: 10.1109/VLSI.Design.2009.13.

DONAHUE J, JIA Y, VINYALS O, et al. DeCAF: A deep convolutional activation feature for generic visual recognition[C]. International Conference on Machine Learning, Beijing, China, 2014: 647-655.

PETE Warden. Why GEMM is at the heart of deep learning[OL]. https://petewarden.com/2015/04/20/ why-gemm-is-at-the-heart-of-deep-learning/.

相關(guān)文章

施引文獻(xiàn)

資源附件(0)

訪問統(tǒng)計