面向深度神經(jīng)網(wǎng)絡(luò)加速芯片的高效硬件優(yōu)化策略

張萌; 張經(jīng)緯; 李國慶; 吳瑞霞; 曾曉洋

doi:10.11999/JEIT210002

面向深度神經(jīng)網(wǎng)絡(luò)加速芯片的高效硬件優(yōu)化策略

doi: 10.11999/JEIT210002

1.
東南大學(xué)電子學(xué)院國家專用集成電路系統(tǒng)工程技術(shù)研究中心南京 210096
2.
復(fù)旦大學(xué)專用集成電路與系統(tǒng)國家重點(diǎn)實(shí)驗(yàn)室上海 200433

基金項(xiàng)目: 國家重點(diǎn)研發(fā)計(jì)劃(2018YFB2202703)，江蘇省自然科學(xué)基金(BK20201145)

詳細(xì)信息

作者簡介:
張萌：男，1964年生，研究員，研究方向?yàn)閿?shù)字信號(hào)處理、深度學(xué)習(xí)算法及硬件加速

張經(jīng)緯：男，1997年生，碩士生，研究方向?yàn)樯疃葘W(xué)習(xí)硬件加速器設(shè)計(jì)

李國慶：男，1991年生，博士生，研究方向?yàn)橛?jì)算機(jī)視覺和深度學(xué)習(xí)硬件加速器設(shè)計(jì)

吳瑞霞：女，1996年生，碩士生，研究方向?yàn)樯疃葘W(xué)習(xí)算法

曾曉洋：男，1972年生，教授，研究方向?yàn)楦吣苄到y(tǒng)芯片(SoC)

通訊作者:
張經(jīng)緯　zhangjingwei@seu.edu.cn

中圖分類號(hào): TN79.1
計(jì)量
- 文章訪問數(shù): 1494
- HTML全文瀏覽量: 544
- PDF下載量: 198
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2021-01-04
- 修回日期: 2021-04-21
- 網(wǎng)絡(luò)出版日期: 2021-04-29
- 刊出日期: 2021-06-18

Efficient Hardware Optimization Strategies for Deep Neural Networks Acceleration Chip

1.
National ASIC Engineering Center, School of Electronic Sci. and Eng., Southeast University, Nanjing 210096, China
2.
National ASIC Key Laboratory, Fudan University, Shanghai 200433, China

Funds: The National Key R&D Program of China(2018YFB2202703), Jiangsu Province of Natural Science and Technology(BK20201145)

摘要

摘要: 輕量級神經(jīng)網(wǎng)絡(luò)部署在低功耗平臺(tái)上的解決方案可有效用于無人機(jī)(UAV)檢測、自動(dòng)駕駛等人工智能(AI)、物聯(lián)網(wǎng)(IOT)領(lǐng)域，但在資源有限情況下，同時(shí)兼顧高精度和低延時(shí)來構(gòu)建深度神經(jīng)網(wǎng)絡(luò)(DNN)加速器是非常有挑戰(zhàn)性的。該文針對此問題提出一系列高效的硬件優(yōu)化策略，包括構(gòu)建可堆疊共享計(jì)算引擎(PE)以平衡不同卷積中數(shù)據(jù)重用和內(nèi)存訪問模式的不一致；提出了可調(diào)的循環(huán)次數(shù)和通道增強(qiáng)方法，有效擴(kuò)展加速器與外部存儲(chǔ)器之間的訪問帶寬，提高DNN淺層網(wǎng)絡(luò)計(jì)算效率；優(yōu)化了預(yù)加載工作流，從整體上提高了異構(gòu)系統(tǒng)的并行度。經(jīng)Xilinx Ultra96 V2板卡驗(yàn)證，該文的硬件優(yōu)化策略有效地改進(jìn)了iSmart3-SkyNet和SkrSkr-SkyNet類的DNN加速芯片設(shè)計(jì)。結(jié)果顯示，優(yōu)化后的加速器每秒處理78.576幀圖像，每幅圖像的功耗為0.068 J。
- 深度神經(jīng)網(wǎng)絡(luò) /
- 目標(biāo)檢測 /
- 神經(jīng)網(wǎng)絡(luò)加速器 /
- 低功耗 /
- 硬件優(yōu)化
Abstract: Lightweight neural networks deployed on low-power platforms have proven to be effective solutions for Artificial Intelligence (AI) and Internet Of Things (IOT) domains such as Unmanned Aerial Vehicle (UAV) detection and unmanned driving. However, in the case of limited resources, it is very challenging to build Deep Neural Networks (DNN) accelerator with both high precision and low delay. In this paper, a series of efficient hardware optimization strategies are proposed, including stackable shared Processing Engine (PE) to balance the inconsistency of data reuse and memory access patterns in different convolutions; Regulable loop parallelism and channel augmentation are proposed to increase effectively the access bandwidth between accelerator and external memory. It also improve the efficiency of DNN shallow layers computing; Pre-Workflow is applied to improve the overall parallelism of heterogeneous systems. Verified by Xilinx Ultra96 V2 board, the hardware optimization strategies in this paper improve effectively the design of DNN acceleration chips like iSmart3-SkyNet and SkrSkr-SkyNet. The results show that the optimized accelerator processes 78.576 frames per second, and the power consumption of each picture is 0.068 Joules.
- Deep Neural Networks (DNN) /
- Object detection /
- Neural network accelerator /
- Low power consumption /
- Hardware optimization

HTML全文

圖 1 iSmart3-SkyNet加速器上的SkyNet Roofline模型分析

下載: 全尺寸圖片幻燈片

圖 2 系統(tǒng)-計(jì)算模塊-線性緩沖區(qū)結(jié)構(gòu)示意圖

下載: 全尺寸圖片幻燈片

圖 3 通道增強(qiáng)流程說明圖

下載: 全尺寸圖片幻燈片

圖 4 3種工作流比較圖

下載: 全尺寸圖片幻燈片

圖 5 優(yōu)化后加速器上的SkyNet Roofline模型分析

下載: 全尺寸圖片幻燈片

圖 6 iSmart3和Skrskr加速優(yōu)化前后性能對比

下載: 全尺寸圖片幻燈片

表 1 SkyNet的體系結(jié)構(gòu)和每個(gè)捆綁包的推理速度表格

捆綁包	層數(shù)	輸入尺寸	操作類型	計(jì)算量、計(jì)算量占比(%)	延遲占比(%)
#1	1	3×160×320	DW-Conv3	119.61M, 20.6	33.90
	2	3×160×320	PW-Conv1
	3	48×160×320	POOLING
#2	4	48×80×160	DW-Conv3	86.02M, 14.42	16.54
	5	48×80×160	PW-Conv1
	6	96×80×160	POOLING
#3	7	96×40×80	DW-Conv3	61.75M, 10.36	6.23
	8	96×40×80	PW-Conv1
	9	192×40×80	POOLING
#4	10	192×20×40	DW-Conv3	60.36M, 10.13	4.92
#4	11	192×20×40	PW-Conv1	60.36M, 10.13	4.92
#5	12	384×20×40	DW-Conv3	160.05M, 26.85	12.43
#5	13	384×20×40	PW-Conv1	160.05M, 26.85	12.43
#6	–	合并第9層輸出		107.52M, 18.04	20.08
	14	1280×20×40	[旁路] DW-Conv3
	15	1280×20×40	PW-Conv1
#7	16	96×20×40	PW-Conv1	0.77M, 0.14	0.10
–	17	10×20×40	計(jì)算回歸框		0.16
CPU	–	–	–	–	5.64

下載: 導(dǎo)出CSV

表 2 優(yōu)化策略效果對比

加速器	iSmart3 ^[9]	SEUer A	Skrskr ^[10]	SEUer B
網(wǎng)絡(luò)模型	SkyNet	SkyNet	SkyNet	SkyNet
量化精度	A9/W11	A9/W11	A8/W6	A8/W6
硬件平臺(tái)	Ultra96V2	Ultra96V2	Ultra96V2	Ultra96V2
準(zhǔn)確率(DJI)	0.716	0.724	0.731	0.731
時(shí)鐘頻率(MHz)	215	215	300	300
DSP數(shù)量	329	287	360	360
LUT數(shù)量(k)	54	54	56	46
FF數(shù)量(k)	60	70	68	51
幀率(fps)	25.05	37.393	52.429	78.576
GOPS/W	3.21	5.95	7.22	11.19
Energy/Pic.(J)	0.289	0.135	0.129	0.068

下載: 導(dǎo)出CSV

參考文獻(xiàn)(17)

[1]	王巍, 周凱利, 王伊昌, 等. 基于快速濾波算法的卷積神經(jīng)網(wǎng)絡(luò)加速器設(shè)計(jì)[J]. 電子與信息學(xué)報(bào), 2019, 41(11): 2578–2584. doi: 10.11999/JEIT190037 WANG Wei, ZHOU Kaili, WANG Yichang, et al. Design of convolutional neural networks accelerator based on fast filter algorithm[J]. Journal of Electronics &Information Technology, 2019, 41(11): 2578–2584. doi: 10.11999/JEIT190037
[2]	ZHANG Xiaofan, WANG Junsong, ZHU Chao, et al. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs[C]. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, USA, 2018: 1–8.
[3]	LI Huimin, FAN Xitian, JIAO Li, et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks[C]. The 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 2016: 1–9.
[4]	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time object detection[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 779–788.
[5]	REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. doi: 10.1109/TPAMI.2016.2577031
[6]	TAN Mingxing, PANG Ruoming, and LE Q V. EfficientDet: Scalable and efficient object detection[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 10781–10790.
[7]	YU Yunxuan, WU Chen, ZHAO Tiandong, et al. OPU: An FPGA-based overlay processor for convolutional neural networks[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2020, 28(1): 35–47. doi: 10.1109/TVLSI.2019.2939726
[8]	YU Yunxuan, ZHAO Tiandong, WANG Kun, et al. Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks[C]. 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, USA, 2020: 122–132.
[9]	ZHANG Xiaofan, LU Haoming, HAO Cong, et al. SkyNet: A hardware-efficient method for object detection and tracking on embedded systems[J]. arXiv: 1909.09709, 2019.
[10]	JIANG W, LIU X, SUN H, et al. Skrskr: Dacsdc. 2020 2nd place winner in fpga track[EB/OL]. https://github.com/jiangwx/SkrSkr/, 2020.
[11]	ZHANG Chen, LI Peng, SUN Guangyu, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]. 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2015: 161–170.
[12]	HAO Cong, ZHANG Xiaofan, LI Yuhong, et al. FPGA/DNN Co-Design: An efficient design methodology for 1ot intelligence on the edge[C]. The 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, USA, 2019: 1–6.
[13]	MOTAMEDI M, GYSEL P, AKELLA V, et al. Design space exploration of FPGA-based deep convolutional neural networks[C]. The 21st Asia and South Pacific Design Automation Conference (ASP-DAC), Macao, China, 2016: 575–580.
[14]	FAN Hongxiang, LIU Shuanglong, FERIANC M, et al. A real-time object detection accelerator with compressed SSDLite on FPGA[C]. 2018 International Conference on Field-Programmable Technology (FPT), Naha, Japan, 2018: 14–21.
[15]	LI Fanrong, MO Zitao, WANG Peisong, et al. A system-level solution for low-power object detection[C]. 2019 IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea (South), 2019: 2461–2468.
[16]	DONG Zhen, WANG Dequan, HUANG Qijing, et al. CoDeNet: Efficient deployment of input-adaptive object detection on embedded FPGAs[J]. arXiv: 2006.08357, 2020.
[17]	WU Di, ZHANG Yu, JIA Xijie, et al. A high-performance CNN processor based on FPGA for MobileNets[C]. The 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 2019: 136–143.