一级黄色片免费播放|中国黄色视频播放片|日本三级a|可以直接考播黄片影视免费一级毛片

高級搜索

留言板

尊敬的讀者、作者、審稿人, 關(guān)于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復(fù)。謝謝您的支持!

姓名
郵箱
手機(jī)號碼
標(biāo)題
留言內(nèi)容
驗(yàn)證碼

一種基于三維可變換CNN加速結(jié)構(gòu)的并行度優(yōu)化搜索算法

屈心媛 徐宇 黃志洪 蔡剛 方震

屈心媛, 徐宇, 黃志洪, 蔡剛, 方震. 一種基于三維可變換CNN加速結(jié)構(gòu)的并行度優(yōu)化搜索算法[J]. 電子與信息學(xué)報, 2022, 44(4): 1503-1512. doi: 10.11999/JEIT210059
引用本文: 屈心媛, 徐宇, 黃志洪, 蔡剛, 方震. 一種基于三維可變換CNN加速結(jié)構(gòu)的并行度優(yōu)化搜索算法[J]. 電子與信息學(xué)報, 2022, 44(4): 1503-1512. doi: 10.11999/JEIT210059
QU Xinyuan, XU Yu, HUANG Zhihong, CAI Gang, FANG Zhen. A Parallelism Strategy Optimization Search Algorithm Based on Three-dimensional Deformable CNN Acceleration Architecture[J]. Journal of Electronics & Information Technology, 2022, 44(4): 1503-1512. doi: 10.11999/JEIT210059
Citation: QU Xinyuan, XU Yu, HUANG Zhihong, CAI Gang, FANG Zhen. A Parallelism Strategy Optimization Search Algorithm Based on Three-dimensional Deformable CNN Acceleration Architecture[J]. Journal of Electronics & Information Technology, 2022, 44(4): 1503-1512. doi: 10.11999/JEIT210059

一種基于三維可變換CNN加速結(jié)構(gòu)的并行度優(yōu)化搜索算法

doi: 10.11999/JEIT210059
基金項(xiàng)目: 國家自然科學(xué)基金(61704173, 61974146),北京市科技重大專項(xiàng)(Z171100000117019)
詳細(xì)信息
    作者簡介:

    屈心媛:女,1994年生,博士生,研究方向?yàn)榛贔PGA的CNN加速器架構(gòu)設(shè)計

    徐宇:男,1990年生,博士,研究方向?yàn)榇笠?guī)模集成電路設(shè)計自動化

    黃志洪:男,1984年生,高級工程師,研究方向?yàn)榭删幊绦酒O(shè)計與FPGA硬件加速

    蔡剛:男,1980年生,正高級工程師,碩士生導(dǎo)師,研究方向?yàn)榧呻娐吩O(shè)計、抗輻照加固設(shè)計、人工智能系統(tǒng)設(shè)計

    方震:男,1976年生,研究員,博士生導(dǎo)師,研究方向?yàn)樾滦歪t(yī)療電子及醫(yī)學(xué)人工智能技術(shù)

    通訊作者:

    黃志洪 huangzhihong@mail.ie.ac.cn

  • 1)本節(jié)給出的數(shù)據(jù)均為基于KCU1500的AlexNet加速器的實(shí)驗(yàn)結(jié)果。2) (Parain=1, Paraseg=2)相當(dāng)于Parain=1/2;(Parain=3, Paraseg=5)相當(dāng)于Parain=3/5;以此類推。
  • 中圖分類號: TN47

A Parallelism Strategy Optimization Search Algorithm Based on Three-dimensional Deformable CNN Acceleration Architecture

Funds: The National Natural Science Foundation of China (61704173, 61974146), The Major Program of Beijing Science and Technology (Z171100000117019)
  • 摘要: 現(xiàn)場可編程門陣列(FPGA)被廣泛應(yīng)用于卷積神經(jīng)網(wǎng)絡(luò)(CNN)的硬件加速中。為優(yōu)化加速器性能,Qu等人(2021)提出了一種3維可變換的CNN加速結(jié)構(gòu),但該結(jié)構(gòu)使得并行度探索空間爆炸增長,搜索最優(yōu)并行度的時間開銷激增,嚴(yán)重降低了加速器實(shí)現(xiàn)的可行性。為此該文提出一種細(xì)粒度迭代優(yōu)化的并行度搜索算法,該算法通過多輪迭代的數(shù)據(jù)篩選,高效地排除冗余的并行度方案,壓縮了超過99%的搜索空間。同時算法采用剪枝操作刪減無效的計算分支,成功地將計算所需時長從106 h量級減少到10 s內(nèi)。該算法可適用于不同規(guī)格型號的FPGA芯片,其搜索得到的最優(yōu)并行度方案性能突出,可在不同芯片上實(shí)現(xiàn)平均(R1, R2)達(dá)(0.957, 0.962)的卓越計算資源利用率。
  • 圖  1  CNN加速器單層結(jié)構(gòu)示意圖

    圖  2  矩陣卷積分段計算示意圖

    圖  3  β=0.20時,算法搜索的計算量隨α的變化情況

    圖  4  基于不同規(guī)格FPGA的AlexNet加速器性能隨(α, β)變化色溫圖

    表  1  AlexNet網(wǎng)絡(luò)結(jié)構(gòu)參數(shù)

    NinNoutSIZEinSIZEoutSIZEkerStrideNpad
    CONV1396227551140
    POOL196965527320
    CONV2482562727512
    POOL22562562713320
    CONV32563841313311
    CONV41923841313311
    CONV51922561313311
    POOL5256256136320
    FC19216409611
    FC24096409611
    FC34096100011
    下載: 導(dǎo)出CSV

    表  2  不同F(xiàn)PGA CNN加速器的資源利用率

    文獻(xiàn)VGG文獻(xiàn)AlexNet
    R1R2R1R2
    [5]0.80.8[3]0.320.38
    [11]0.710.71[4]0.420.55
    [14]0.770.84[6]0.500.85
    [8]0.780.99[8]0.670.76
    [15]0.660.80[14]0.620.78
    下載: 導(dǎo)出CSV

    表  3  細(xì)粒度并行度迭代算法

     輸入:片上可用DSP數(shù)#DSPlimit、可用BRAM數(shù)量#BRAMlimit、CNN網(wǎng)絡(luò)結(jié)構(gòu)參數(shù)及α, β
     輸出:Parain,Paraout及Paraseg
     (1) 計算各層計算量#OPi與網(wǎng)絡(luò)總計算量#OPtotal之比γi
     (2) 按照計算量分布比例將片上可用DSP分配給各層,各層分配到的DSP數(shù)#DSPiallocγi·#DSPtotal
     (3)根據(jù)計算總量和計算資源總數(shù),算出理論最小計算周期數(shù)#cyclebaseline
     (4) 第i層,遍歷Parain,Paraout及ROWout的所有離散可行取值(即3者定義域形成的笛卡兒積),生成全組合情況下的并行度參數(shù)配置
       集S0i,計算對應(yīng)的#cyclei、#BRAMi與#DSPi。
     (5) 篩選滿足α, β約束的數(shù)據(jù)集Si。
     Si←select ele from S0i where (#cyclei/#cyclebaseline in [1–α,1+α] and #DSPi/#DSPialloc in [1–β,1+β])
     (6)數(shù)據(jù)粗篩,集合Si:任意相鄰的兩個元素不存在“KO”偏序關(guān)系。
       for i in range(5):
         orders←[(cycle, dsp, bram), (dsp, cycle, bram), (bram, cycle, dsp)]
         for k in range(3):
           Si.sort_ascend_by(orders[k])
           p←0
           for j in range(1, size(Si)):
             if σj KO σp then Si.drop(σp), pj
             else Si.drop(σj)
         SiSi
     (7)數(shù)據(jù)精篩,集合Ti:任意兩個元素不存在“KO”偏序關(guān)系。
      for i in range(5):
       Si.sort_ascend_by((cycle,dsp,bram))
       for j in range(1, size(Si)):
          for k in range(j):
           if σk KO σj then S’i.drop(σj), break
       TiS’i
     (8)搜索剪枝。
      maxCycle←INT_MAX, dspUsed←0, bramUsed←0
      def calc(i):
       if i==5 then
         update(maxCycle)
         return
       for j in range(size(Ti)):
         tmpDsp←dspUsed+dspji, tmpBram←bramUsed+bramji
         if not(tmpDsp>dspTotal or tmpBram>bramTotal or
           cycleji≥maxCycle) then
           dspUsed←tmpDsp, bramUsed←tmpBram
           calc(i+1)
           dspUsed←tmpDsp-dspji, bramUsed←tmpBram-bramji
         else
           continue
      calc(0)
     (9)選出maxCycle(即min{max{#cyclei}})對應(yīng)的并行度元素,輸出約束條件下最優(yōu)并行度的參數(shù)信息。
    下載: 導(dǎo)出CSV

    表  4  不同規(guī)格FPGA上AlexNet加速器資源利用率、計算量與計算時長

    FPGA型號DSP資源數(shù)R1R2原始計算量壓縮比(%)執(zhí)行時間(s)
    Arria10 GT 115015180.9870.9895.683×10799.8921.544
    KU06027600.9470.9513.026×10899.9796.444
    Virtex7 VX485T28000.9360.9419.903×10899.9945.841
    Virtex7 VX690T36000.9600.9672.082×10899.9982.775
    KCU150055200.9550.9625.772×10999.9998.115
    下載: 導(dǎo)出CSV

    表  5  AlexNet加速器性能對比

    型號文獻(xiàn)[4]文獻(xiàn)[11]文獻(xiàn)[12]文獻(xiàn)[8]本文
    量化位寬16 bit定點(diǎn)16 bit定點(diǎn)16 bit定點(diǎn)16 bit定點(diǎn)16 bit定點(diǎn)
    頻率(MHz)250/500200150220230
    FPGA型號KCU1500Arria10GX1150ZynqXC7Z045KCU1500KCU1500
    吞吐率(GOP/s)2335.4584.8137.01633.02425.5
    性能功耗比(GOP/s/W)37.31無數(shù)據(jù)14.2172.3162.35
    資源利用率(R1, R2)(0.42, 0.55)(0.48, 0.48)(0.51, 0.59)(0.67, 0.76)(0.96, 0.96)
    下載: 導(dǎo)出CSV
  • [1] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. doi: 10.1109/5.726791
    [2] QU Xinyuan, HUANG Zhihong, XU Yu, et al. Cheetah: An accurate assessment mechanism and a high-throughput acceleration architecture oriented toward resource efficiency[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021, 40(5): 878–891. doi: 10.1109/TCAD.2020.3011650
    [3] REGGIANI E, RABOZZI M, NESTOROV A M, et al. Pareto optimal design space exploration for accelerated CNN on FPGA[C]. 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, 2019: 107–114. doi: 10.1109/IPDPSW.2019.00028.
    [4] YU Xiaoyu, WANG Yuwei, MIAO Jie, et al. A data-center FPGA acceleration platform for convolutional neural networks[C]. 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 2019: 151–158. doi: 10.1109/FPL.2019.00032.
    [5] LIU Zhiqiang, CHOW P, XU Jinwei, et al. A uniform architecture design for accelerating 2D and 3D CNNs on FPGAs[J]. Electronics, 2019, 8(1): 65. doi: 10.3390/electronics8010065
    [6] LI Huimin, FAN Xitian, JIAO Li, et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks[C]. 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Swiss, 2016: 1–9. doi: 10.1109/FPL.2016.7577308.
    [7] QIU Jiantao, WANG Jie, YAO Song, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]. The 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, California, USA, 2016: 26–35.
    [8] ZHANG Xiaofan, WANG Junsong, ZHU Chao, et al. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs[C]. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, USA, 2018: 1–8. doi: 10.1145/3240765.3240801.
    [9] LIU Zhiqiang, DOU Yong, JIANG Jingfei, et al. Automatic code generation of convolutional neural networks in FPGA implementation[C]. 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 2016: 61–68. doi: 10.1109/FPT.2016.7929190.
    [10] KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84–90. doi: 10.1145/3065386
    [11] MA Yufei, CAO Yu, VRUDHULA S, et al. Optimizing the convolution operation to accelerate deep neural networks on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018, 26(7): 1354–1367. doi: 10.1109/TVLSI.2018.2815603
    [12] GUO Kaiyuan, SUI Lingzhi, QIU Jiantao, et al. Angel-Eye: A complete design flow for mapping CNN onto embedded FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 37(1): 35–47. doi: 10.1109/TCAD.2017.2705069
    [13] ZHANG Chen, SUN Guangyu, FANG Zhenman, et al. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38(11): 2072–2085. doi: 10.1109/TCAD.2017.2785257
    [14] ZHANG Jialiang and LI Jing. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network[C]. The 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, California, USA, 2017: 25–34. doi: 10.1145/3020078.3021698.
    [15] LIU Zhiqiang, DOU Yong, JIANG Jingfei, et al. Throughput-optimized FPGA accelerator for deep convolutional neural networks[J]. ACM Transactions on Reconfigurable Technology and Systems, 2017, 10(3): 17. doi: 10.1145/3079758
  • 加載中
圖(4) / 表(5)
計量
  • 文章訪問數(shù):  1112
  • HTML全文瀏覽量:  454
  • PDF下載量:  110
  • 被引次數(shù): 0
出版歷程
  • 收稿日期:  2021-01-08
  • 修回日期:  2021-08-04
  • 網(wǎng)絡(luò)出版日期:  2021-09-09
  • 刊出日期:  2022-04-18

目錄

    /

    返回文章
    返回