一種基于三維可變換CNN加速結(jié)構(gòu)的并行度優(yōu)化搜索算法

屈心媛; 徐宇; 黃志洪; 蔡剛; 方震

doi:10.11999/JEIT210059

一種基于三維可變換CNN加速結(jié)構(gòu)的并行度優(yōu)化搜索算法

doi: 10.11999/JEIT210059 cstr: 32379.14.JEIT210059

1.
中國科學(xué)院空天信息創(chuàng)新研究院北京 100190
2.
中國科學(xué)院大學(xué)電子電氣與通信工程學(xué)院北京 100049

基金項(xiàng)目: 國家自然科學(xué)基金(61704173, 61974146)，北京市科技重大專項(xiàng)(Z171100000117019)

詳細(xì)信息

作者簡(jiǎn)介:
屈心媛：女，1994年生，博士生，研究方向?yàn)榛贔PGA的CNN加速器架構(gòu)設(shè)計(jì)

徐宇：男，1990年生，博士，研究方向?yàn)榇笠?guī)模集成電路設(shè)計(jì)自動(dòng)化

黃志洪：男，1984年生，高級(jí)工程師，研究方向?yàn)榭删幊绦酒O(shè)計(jì)與FPGA硬件加速

蔡剛：男，1980年生，正高級(jí)工程師，碩士生導(dǎo)師，研究方向?yàn)榧呻娐吩O(shè)計(jì)、抗輻照加固設(shè)計(jì)、人工智能系統(tǒng)設(shè)計(jì)

方震：男，1976年生，研究員，博士生導(dǎo)師，研究方向?yàn)樾滦歪t(yī)療電子及醫(yī)學(xué)人工智能技術(shù)

通訊作者:
黃志洪　huangzhihong@mail.ie.ac.cn

¹⁾本節(jié)給出的數(shù)據(jù)均為基于KCU1500的AlexNet加速器的實(shí)驗(yàn)結(jié)果。²⁾ (Para_in=1, Para_seg=2)相當(dāng)于Para_in=1/2；(Para_in=3, Para_seg=5)相當(dāng)于Para_in=3/5；以此類推。
中圖分類號(hào): TN47
計(jì)量
- 文章訪問數(shù): 1143
- HTML全文瀏覽量: 495
- PDF下載量: 110
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2021-01-08
- 修回日期: 2021-08-04
- 網(wǎng)絡(luò)出版日期: 2021-09-09
- 刊出日期: 2022-04-18

A Parallelism Strategy Optimization Search Algorithm Based on Three-dimensional Deformable CNN Acceleration Architecture

1.
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
2.
School of Electronic, Electrical, and Communication Engineering, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China

Funds: The National Natural Science Foundation of China (61704173, 61974146), The Major Program of Beijing Science and Technology (Z171100000117019)

摘要

摘要: 現(xiàn)場(chǎng)可編程門陣列(FPGA)被廣泛應(yīng)用于卷積神經(jīng)網(wǎng)絡(luò)(CNN)的硬件加速中。為優(yōu)化加速器性能，Qu等人(2021)提出了一種3維可變換的CNN加速結(jié)構(gòu)，但該結(jié)構(gòu)使得并行度探索空間爆炸增長(zhǎng)，搜索最優(yōu)并行度的時(shí)間開銷激增，嚴(yán)重降低了加速器實(shí)現(xiàn)的可行性。為此該文提出一種細(xì)粒度迭代優(yōu)化的并行度搜索算法，該算法通過多輪迭代的數(shù)據(jù)篩選，高效地排除冗余的并行度方案，壓縮了超過99%的搜索空間。同時(shí)算法采用剪枝操作刪減無效的計(jì)算分支，成功地將計(jì)算所需時(shí)長(zhǎng)從10⁶ h量級(jí)減少到10 s內(nèi)。該算法可適用于不同規(guī)格型號(hào)的FPGA芯片，其搜索得到的最優(yōu)并行度方案性能突出，可在不同芯片上實(shí)現(xiàn)平均(R1, R2)達(dá)(0.957, 0.962)的卓越計(jì)算資源利用率。
- 現(xiàn)場(chǎng)可編程門陣列 /
- 卷積神經(jīng)網(wǎng)絡(luò) /
- 硬件加速
Abstract: Field Programmable Gate Array (FPGA) is widely used in Convolutional Neural Network (CNN) hardware acceleration. For better performance, a three-dimensional transformable CNN acceleration structure is proposed by Qu et al (2021). However, this structure brings an explosive growth of the parallelism strategy exploration space, thus the time cost to search the optimal parallelism has surged, which reduces severely the feasibility of accelerator implementation. To solve this issue, a fine-grained iterative optimization parallelism search algorithm is proposed in this paper. The algorithm uses multiple rounds of iterative data filtering to eliminate efficiently the redundant parallelism schemes, compressing more than 99% of the search space. At the same time, the algorithm uses pruning operation to delete invalid calculation branches, and reduces successfully the calculation time from 10⁶ h to less than 10 s. The algorithm can achieve outstanding performance in different kinds of FPGAs, with an average computing resource utilization (R1, R2) up to (0.957, 0.962).
- Field Programmable Gate Array (FPGA) /
- Convolutional Neural Network (CNN) /
- Hardware acceleration
¹⁾本節(jié)給出的數(shù)據(jù)均為基于KCU1500的AlexNet加速器的實(shí)驗(yàn)結(jié)果。²⁾ (Para_in=1, Para_seg=2)相當(dāng)于Para_in=1/2；(Para_in=3, Para_seg=5)相當(dāng)于Para_in=3/5；以此類推。

HTML全文

圖 1 CNN加速器單層結(jié)構(gòu)示意圖

下載: 全尺寸圖片幻燈片

圖 2 矩陣卷積分段計(jì)算示意圖

下載: 全尺寸圖片幻燈片

圖 3 β=0.20時(shí)，算法搜索的計(jì)算量隨α的變化情況

下載: 全尺寸圖片幻燈片

圖 4 基于不同規(guī)格FPGA的AlexNet加速器性能隨(α, β)變化色溫圖

下載: 全尺寸圖片幻燈片

表 1 AlexNet網(wǎng)絡(luò)結(jié)構(gòu)參數(shù)

層	N_in	N_out	SIZE_in	SIZE_out	SIZE_ker	Stride	N_pad
CONV1	3	96	227	55	11	4	0
POOL1	96	96	55	27	3	2	0
CONV2	48	256	27	27	5	1	2
POOL2	256	256	27	13	3	2	0
CONV3	256	384	13	13	3	1	1
CONV4	192	384	13	13	3	1	1
CONV5	192	256	13	13	3	1	1
POOL5	256	256	13	6	3	2	0
FC1	9216	4096	1	1	–	–	–
FC2	4096	4096	1	1	–	–	–
FC3	4096	1000	1	1	–	–	–

下載: 導(dǎo)出CSV

表 2 不同F(xiàn)PGA CNN加速器的資源利用率

文獻(xiàn)	VGG		文獻(xiàn)	AlexNet
文獻(xiàn)	R1	R2	文獻(xiàn)	R1	R2
[5]	0.8	0.8	[3]	0.32	0.38
[11]	0.71	0.71	[4]	0.42	0.55
[14]	0.77	0.84	[6]	0.50	0.85
[8]	0.78	0.99	[8]	0.67	0.76
[15]	0.66	0.80	[14]	0.62	0.78

下載: 導(dǎo)出CSV

表 3 細(xì)粒度并行度迭代算法

輸入：片上可用DSP數(shù)#DSP_limit、可用BRAM數(shù)量#BRAM_limit、CNN網(wǎng)絡(luò)結(jié)構(gòu)參數(shù)及α, β
輸出：Para_in，Para_out及Para_seg
(1) 計(jì)算各層計(jì)算量#OP_i與網(wǎng)絡(luò)總計(jì)算量#OP_total之比γ_i。
(2) 按照計(jì)算量分布比例將片上可用DSP分配給各層，各層分配到的DSP數(shù)#DSPⁱ_alloc←γ_i·#DSP_total
(3)根據(jù)計(jì)算總量和計(jì)算資源總數(shù)，算出理論最小計(jì)算周期數(shù)#cycle_baseline。
(4) 第i層，遍歷Para_in，Para_out及ROW_out的所有離散可行取值(即3者定義域形成的笛卡兒積)，生成全組合情況下的并行度參數(shù)配置　　集S⁰_i，計(jì)算對(duì)應(yīng)的#cycle_i、#BRAM_i與#DSP_i。
(5) 篩選滿足α, β約束的數(shù)據(jù)集S_i。
S_i←select ele from S⁰_iwhere (#cycle_i/#cycle_baseline in [1–α,1+α] and #DSP_i/#DSPⁱ_alloc in [1–β,1+β])
(6)數(shù)據(jù)粗篩，集合S’_i：任意相鄰的兩個(gè)元素不存在“KO”偏序關(guān)系。
for i in range(5):
orders←[(cycle, dsp, bram), (dsp, cycle, bram), (bram, cycle, dsp)]
for k in range(3):
S_i.sort_ascend_by(orders[k])
p←0
for j in range(1, size(S_i)):
if σ_j KO σ_p then S_i.drop(σ_p), p←j
else S_i.drop(σ_j)
S’_i←S_i
(7)數(shù)據(jù)精篩，集合T_i：任意兩個(gè)元素不存在“KO”偏序關(guān)系。
for i in range(5):
S’_i.sort_ascend_by((cycle,dsp,bram))
for j in range(1, size(S’_i)):
for k in range(j):
if σ_k KO σ_j then S’_i.drop(σ_j), break
T_i←S’_i
(8)搜索剪枝。
maxCycle←INT_MAX, dspUsed←0, bramUsed←0
def calc(i):
if i==5 then
update(maxCycle)
return
for j in range(size(T_i)):
tmpDsp←dspUsed+dsp^j_i, tmpBram←bramUsed+bram^j_i
if not(tmpDsp>dspTotal or tmpBram>bramTotal or
cycle^j_i≥maxCycle) then
dspUsed←tmpDsp, bramUsed←tmpBram
calc(i+1)
dspUsed←tmpDsp-dsp^j_i, bramUsed←tmpBram-bram^j_i
else
continue
calc(0)
(9)選出maxCycle（即min{max{#cycle_i}}）對(duì)應(yīng)的并行度元素，輸出約束條件下最優(yōu)并行度的參數(shù)信息。

下載: 導(dǎo)出CSV

表 4 不同規(guī)格FPGA上AlexNet加速器資源利用率、計(jì)算量與計(jì)算時(shí)長(zhǎng)

FPGA型號(hào)	DSP資源數(shù)	R1	R2	原始計(jì)算量	壓縮比(%)	執(zhí)行時(shí)間(s)
Arria10 GT 1150	1518	0.987	0.989	5.683×10⁷	99.892	1.544
KU060	2760	0.947	0.951	3.026×10⁸	99.979	6.444
Virtex7 VX485T	2800	0.936	0.941	9.903×10⁸	99.994	5.841
Virtex7 VX690T	3600	0.960	0.967	2.082×10⁸	99.998	2.775
KCU1500	5520	0.955	0.962	5.772×10⁹	99.999	8.115

下載: 導(dǎo)出CSV

表 5 AlexNet加速器性能對(duì)比

型號(hào)	文獻(xiàn)[4]	文獻(xiàn)[11]	文獻(xiàn)[12]	文獻(xiàn)[8]	本文
量化位寬	16 bit定點(diǎn)	16 bit定點(diǎn)	16 bit定點(diǎn)	16 bit定點(diǎn)	16 bit定點(diǎn)
頻率(MHz)	250/500	200	150	220	230
FPGA型號(hào)	KCU1500	Arria10GX1150	ZynqXC7Z045	KCU1500	KCU1500
吞吐率(GOP/s)	2335.4	584.8	137.0	1633.0	2425.5
性能功耗比(GOP/s/W)	37.31	無數(shù)據(jù)	14.21	72.31	62.35
資源利用率(R1, R2)	(0.42, 0.55)	(0.48, 0.48)	(0.51, 0.59)	(0.67, 0.76)	(0.96, 0.96)

下載: 導(dǎo)出CSV

參考文獻(xiàn)(15)

[1]	LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. doi: 10.1109/5.726791
[2]	QU Xinyuan, HUANG Zhihong, XU Yu, et al. Cheetah: An accurate assessment mechanism and a high-throughput acceleration architecture oriented toward resource efficiency[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021, 40(5): 878–891. doi: 10.1109/TCAD.2020.3011650
[3]	REGGIANI E, RABOZZI M, NESTOROV A M, et al. Pareto optimal design space exploration for accelerated CNN on FPGA[C]. 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, 2019: 107–114. doi: 10.1109/IPDPSW.2019.00028.
[4]	YU Xiaoyu, WANG Yuwei, MIAO Jie, et al. A data-center FPGA acceleration platform for convolutional neural networks[C]. 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 2019: 151–158. doi: 10.1109/FPL.2019.00032.
[5]	LIU Zhiqiang, CHOW P, XU Jinwei, et al. A uniform architecture design for accelerating 2D and 3D CNNs on FPGAs[J]. Electronics, 2019, 8(1): 65. doi: 10.3390/electronics8010065
[6]	LI Huimin, FAN Xitian, JIAO Li, et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks[C]. 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Swiss, 2016: 1–9. doi: 10.1109/FPL.2016.7577308.
[7]	QIU Jiantao, WANG Jie, YAO Song, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]. The 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, California, USA, 2016: 26–35.
[8]	ZHANG Xiaofan, WANG Junsong, ZHU Chao, et al. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs[C]. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, USA, 2018: 1–8. doi: 10.1145/3240765.3240801.
[9]	LIU Zhiqiang, DOU Yong, JIANG Jingfei, et al. Automatic code generation of convolutional neural networks in FPGA implementation[C]. 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 2016: 61–68. doi: 10.1109/FPT.2016.7929190.
[10]	KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84–90. doi: 10.1145/3065386
[11]	MA Yufei, CAO Yu, VRUDHULA S, et al. Optimizing the convolution operation to accelerate deep neural networks on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018, 26(7): 1354–1367. doi: 10.1109/TVLSI.2018.2815603
[12]	GUO Kaiyuan, SUI Lingzhi, QIU Jiantao, et al. Angel-Eye: A complete design flow for mapping CNN onto embedded FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 37(1): 35–47. doi: 10.1109/TCAD.2017.2705069
[13]	ZHANG Chen, SUN Guangyu, FANG Zhenman, et al. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38(11): 2072–2085. doi: 10.1109/TCAD.2017.2785257
[14]	ZHANG Jialiang and LI Jing. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network[C]. The 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, California, USA, 2017: 25–34. doi: 10.1145/3020078.3021698.
[15]	LIU Zhiqiang, DOU Yong, JIANG Jingfei, et al. Throughput-optimized FPGA accelerator for deep convolutional neural networks[J]. ACM Transactions on Reconfigurable Technology and Systems, 2017, 10(3): 17. doi: 10.1145/3079758