一级黄色片免费播放|中国黄色视频播放片|日本三级a|可以直接考播黄片影视免费一级毛片

高級(jí)搜索

留言板

尊敬的讀者、作者、審稿人, 關(guān)于本刊的投稿、審稿、編輯和出版的任何問(wèn)題, 您可以本頁(yè)添加留言。我們將盡快給您答復(fù)。謝謝您的支持!

姓名
郵箱
手機(jī)號(hào)碼
標(biāo)題
留言內(nèi)容
驗(yàn)證碼

SMCA:基于芯粒集成的存算一體加速器擴(kuò)展框架

李雯 王穎 何銀濤 鄒凱偉 李華偉 李曉維

李雯, 王穎, 何銀濤, 鄒凱偉, 李華偉, 李曉維. SMCA:基于芯粒集成的存算一體加速器擴(kuò)展框架[J]. 電子與信息學(xué)報(bào), 2024, 46(11): 4081-4091. doi: 10.11999/JEIT240284
引用本文: 李雯, 王穎, 何銀濤, 鄒凱偉, 李華偉, 李曉維. SMCA:基于芯粒集成的存算一體加速器擴(kuò)展框架[J]. 電子與信息學(xué)報(bào), 2024, 46(11): 4081-4091. doi: 10.11999/JEIT240284
LI Wen, WANG Ying, HE Yintao, ZOU Kaiwei, LI Huawei, LI Xiaowei. SMCA: A Framework for Scaling Chiplet-Based Computing-in-Memory Accelerators[J]. Journal of Electronics & Information Technology, 2024, 46(11): 4081-4091. doi: 10.11999/JEIT240284
Citation: LI Wen, WANG Ying, HE Yintao, ZOU Kaiwei, LI Huawei, LI Xiaowei. SMCA: A Framework for Scaling Chiplet-Based Computing-in-Memory Accelerators[J]. Journal of Electronics & Information Technology, 2024, 46(11): 4081-4091. doi: 10.11999/JEIT240284

SMCA:基于芯粒集成的存算一體加速器擴(kuò)展框架

doi: 10.11999/JEIT240284
基金項(xiàng)目: 國(guó)家自然科學(xué)基金(62302283),山西省基礎(chǔ)研究計(jì)劃項(xiàng)目(自由探索類)(202303021212015)
詳細(xì)信息
    作者簡(jiǎn)介:

    李雯:女,講師,研究方向?yàn)槿蒎e(cuò)計(jì)算和集成電路設(shè)計(jì)

    王穎:男,研究員,研究方向?yàn)樾滦虴DA、處理器與存儲(chǔ)系統(tǒng)體系結(jié)構(gòu)

    何銀濤:女,博士生,研究方向?yàn)榇嫠阋惑w芯片、專用處理器設(shè)計(jì)

    鄒凱偉:女,博士后,研究方向?yàn)橹悄苄酒O(shè)計(jì)

    李華偉:女,研究員,研究方向?yàn)閂LSI測(cè)試、容錯(cuò)計(jì)算

    李曉維:男,研究員,研究方向?yàn)橛布踩?、集成電路設(shè)計(jì)自動(dòng)化

    通訊作者:

    王穎 wangying2009@ict.ac.cn

  • 中圖分類號(hào): TN40; TP389.1

SMCA: A Framework for Scaling Chiplet-Based Computing-in-Memory Accelerators

Funds: The National Natural Science Foundation of China (62302283), The Basic Research Program of Shanxi Province (Exploration Research)(202303021212015)
  • 摘要: 基于可變電阻式隨機(jī)存取存儲(chǔ)器(ReRAM)的存算一體芯片已經(jīng)成為加速深度學(xué)習(xí)應(yīng)用的一種高效解決方案。隨著智能化應(yīng)用的不斷發(fā)展,規(guī)模越來(lái)越大的深度學(xué)習(xí)模型對(duì)處理平臺(tái)的計(jì)算和存儲(chǔ)資源提出了更高的要求。然而,由于ReRAM器件的非理想性,基于ReRAM的大規(guī)模計(jì)算芯片面臨著低良率與低可靠性的嚴(yán)峻挑戰(zhàn)。多芯粒集成的芯片架構(gòu)通過(guò)將多個(gè)小芯粒封裝到單個(gè)芯片中,提高了芯片良率、降低了芯片制造成本,已經(jīng)成為芯片設(shè)計(jì)的主要發(fā)展趨勢(shì)。然而,相比于單片式芯片數(shù)據(jù)的片上傳輸,芯粒間的昂貴通信成為多芯粒集成芯片的性能瓶頸,限制了集成芯片的算力擴(kuò)展。因此,該文提出一種基于芯粒集成的存算一體加速器擴(kuò)展框架—SMCA。該框架通過(guò)對(duì)深度學(xué)習(xí)計(jì)算任務(wù)的自適應(yīng)劃分和基于可滿足性模理論(SMT)的自動(dòng)化任務(wù)部署,在芯粒集成的深度學(xué)習(xí)加速器上生成高能效、低傳輸開銷的工作負(fù)載調(diào)度方案,實(shí)現(xiàn)系統(tǒng)性能與能效的有效提升。實(shí)驗(yàn)結(jié)果表明,與現(xiàn)有策略相比,SMCA為深度學(xué)習(xí)任務(wù)在集成芯片上自動(dòng)生成的調(diào)度優(yōu)化方案可以降低35%的芯粒間通信能耗。
  • 圖  1  在 ReRAM 交叉陣列上執(zhí)行卷積計(jì)算的示意圖

    圖  2  SMCA 工作流程

    圖  3  同構(gòu)存算一體芯粒集成的深度學(xué)習(xí)芯片架構(gòu)

    圖  4  深度學(xué)習(xí)計(jì)算任務(wù)的平均劃分策略

    圖  5  CAP 策略與 CMP 策略的對(duì)比

    圖  6  歸一化的 NoP 能耗

    圖  7  歸一化的 NoP 時(shí)延

    圖  8  不同大小芯粒、不同規(guī)模系統(tǒng)的集成芯片上的 NoP 能耗對(duì)比

    1  自適應(yīng)層級(jí)網(wǎng)絡(luò)劃分策略

     1: 輸入:?jiǎn)蝹€(gè)芯粒的固定算力M;網(wǎng)絡(luò)$l({l_0},{l_1}, \cdots,{l_{L - 1}}) $的算力
     需求$w({w_0},{w_1}, \cdots ,{w_{L - 1}}) $。
     2: 輸出:網(wǎng)絡(luò)劃分策略bestP。
     3: ${C_{{\text{idle}}}}{{ = M}} $; /*初始化${C_{{\text{idle}}}} $*/
     4: for $i = 0,1, \cdots ,L - 1 $
     5:  if ${C_{{\text{idle}}}} \ge {w_i} $ then
     6:   ${\text{bestP}} \leftarrow {\text{NoPartition}}(i{\text{,}}{w_i}) $;
     7:  else if $\left\lceil {\dfrac{{{w_i}}}{{{M}}} = = \dfrac{{{w_i} - {C_{{\text{idle}}}}}}{{{M}}}} \right\rceil $ then
     8:   ${\text{bestP}} \leftarrow {\text{CMP}}(i{\text{,}}{w_i}) $;
     9:  else
     10:   ${\text{bestP}} \leftarrow {\text{CAP}}(i{\text{,}}{w_i}) $;
     11: Update(${C_{{\text{idle}}}} $)
    下載: 導(dǎo)出CSV

    表  1  SMT約束中的符號(hào)表示

    符號(hào) 含義
    $ {T},{E},{C} $ 計(jì)算任務(wù)集合,計(jì)算圖中邊的集合以及
    芯片封裝的芯粒集合
    $ t,c $ 計(jì)算任務(wù)$ t $,芯粒$ c $
    $ {e}_{i,j} $ 計(jì)算圖中,任務(wù)$ i $到任務(wù)$ j $的有向邊
    $ {x}^{c},\;{y}^{c} $ 芯粒$ c $在芯片上的$ \left(x,y\right) $坐標(biāo)
    $ {w}^{t} $ 任務(wù)$ t $的計(jì)算需求
    $ {o}^{t} $ 任務(wù)$ t $計(jì)算產(chǎn)生的中間數(shù)據(jù)量
    $ {s}^{t} $ 任務(wù)$ t $的開始執(zhí)行時(shí)間
    $ q7j3ldu95^{t} $ 完成任務(wù)t所有前置任務(wù)所需的芯粒間最小數(shù)據(jù)傳輸開銷
    $ {\tau }^{t} $ 任務(wù)$ t $的執(zhí)行時(shí)間
    $ \mathrm{s}{\mathrm{w}}^{c} $ 芯粒$ c $所在的波前編號(hào)
    $ \mathrmq7j3ldu95\mathrm{i}\mathrm{s}({c}_{i},{c}_{j}) $ 芯粒$ i $到芯粒$ j $的距離
    下載: 導(dǎo)出CSV

    表  2  系統(tǒng)配置

    架構(gòu)層次 屬性 參數(shù)
    封裝 頻率 1.8 GHz
    芯粒間互聯(lián)網(wǎng)絡(luò)帶寬 100 Gb/s
    芯粒間通信能耗 1.75 p/bit
    芯粒 工藝制程 16 nm
    單個(gè)芯粒包含的計(jì)算核個(gè)數(shù) 16
    單個(gè)計(jì)算核包含的ReRAM交叉
    陣列個(gè)數(shù)
    16
    計(jì)算核

    ReRAM交叉陣列大小 128$ \times $128
    ADC 1 bit
    DAC 8 bit
    一個(gè)ReRAM單元存儲(chǔ)的位數(shù) 2
    權(quán)重精度 8 bit
    數(shù)據(jù)流 權(quán)重固定型
    下載: 導(dǎo)出CSV
  • [1] THOMPSON N C, GREENEWALD K, LEE K, et al. The computational limits of deep learning[EB/OL]. https://arxiv.org/abs/2007.05558, 2022.
    [2] HAN Yinhe, XU Haobo, LU Meixuan, et al. The big chip: Challenge, model and architecture[J]. Fundamental Research, 2023. doi: 10.1016/j.fmre.2023.10.020.
    [3] FENG Yinxiao and MA Kaisheng. Chiplet actuary: A quantitative cost model and multi-chiplet architecture exploration[C]. The 59th ACM/IEEE Design Automation Conference, San Francisco, USA, 2022: 121–126. doi: 10.1145/3489517.35304.
    [4] SHAFIEE A, NAG A, MURALIMANOHAR N, et al. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars[C]. 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture, Seoul, the Republic of Korea, 2016: 14–26. doi: 10.1109/ISCA.2016.12.
    [5] KRISHNAN G, GOKSOY A A, MANDAL S K, et al. Big-little chiplets for in-memory acceleration of DNNs: A scalable heterogeneous architecture[C]. 2022 IEEE/ACM International Conference on Computer Aided Design, San Diego, USA, 2022: 1–9.
    [6] LI Wen, WANG Ying, LI Huawei, et al. RRAMedy: Protecting ReRAM-based neural network from permanent and soft faults during its lifetime[C]. 2019 IEEE 37th International Conference on Computer Design (ICCD), Abu Dhabi, United Arab Emirates, 2019: 91–99. doi: 10.1109/ICCD46524.2019.00020.
    [7] AKINAGA H and SHIMA H. ReRAM technology; challenges and prospects[J]. IEICE Electronics Express, 2012, 9(8): 795–807. doi: 10.1587/elex.9.795.
    [8] IYER S S. Heterogeneous integration for performance and scaling[J]. IEEE Transactions on Components, Packaging and Manufacturing Technology, 2016, 6(7): 973–982. doi: 10.1109/TCPMT.2015.2511626.
    [9] SABAN K. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency[R]. Virtex-7 FPGAs, 2011.
    [10] WADE M, ANDERSON E, ARDALAN S, et al. TeraPHY: A chiplet technology for low-power, high-bandwidth in-package optical I/O[J]. IEEE Micro, 2020, 40(2): 63–71. doi: 10.1109/MM.2020.2976067.
    [11] 王夢(mèng)迪, 王穎, 劉成, 等. Puzzle: 面向深度學(xué)習(xí)集成芯片的可擴(kuò)展框架[J]. 計(jì)算機(jī)研究與發(fā)展, 2023, 60(6): 1216–1231. doi: 10.7544/issn1000-1239.202330059.

    WANG Mengdi, WANG Ying, LIU Cheng, et al. Puzzle: A scalable framework for deep learning integrated chips[J]. Journal of Computer Research and Development, 2023, 60(6): 1216–1231. doi: 10.7544/issn1000-1239.202330059.
    [12] KRISHNAN G, MANDAL S K, PANNALA M, et al. SIAM: Chiplet-based scalable in-memory acceleration with mesh for deep neural networks[J]. ACM Transactions on Embedded Computing Systems (TECS), 2021, 20(5s): 68. doi: 10.1145/3476999.
    [13] SHAO Y S, CEMONS J, VENKATESAN R, et al. Simba: Scaling deep-learning inference with chiplet-based architecture[J]. Communications of the ACM, 2021, 64(6): 107–116. doi: 10.1145/3460227.
    [14] TAN Zhanhong, CAI Hongyu, DONG Runpei, et al. NN-Baton: DNN workload orchestration and chiplet granularity exploration for multichip accelerators[C]. 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 2021: 1013–1026. doi: 10.1109/ISCA52012.2021.00083.
    [15] LI Wanqian, HAN Yinhe, and CHEN Xiaoming. Mathematical framework for optimizing crossbar allocation for ReRAM-based CNN accelerators[J]. ACM Transactions on Design Automation of Electronic Systems, 2024, 29(1): 21. doi: 10.1145/3631523.
    [16] GOMES W, KOKER A, STOVER P, et al. Ponte vecchio: A multi-tile 3D stacked processor for exascale computing[C]. 2022 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, USA, 2022: 42–44, doi: 10.1109/ISSCC42614.2022.9731673.
    [17] ZHU Haozhe, JIAO Bo, ZHANG Jinshan, et al. COMB-MCM: Computing-on-memory-boundary NN processor with bipolar bitwise sparsity optimization for scalable multi-chiplet-module edge machine learning[C]. 2022 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, USA, 2022: 1–3. doi: 10.1109/ISSCC42614.2022.9731657.
    [18] HWANG R, KIM T, KWON Y, et al. Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations[C]. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 2020: 968–981. doi: 10.1109/ISCA45697.2020.00083.
    [19] SHARMA H, MANDAL S K, DOPPA J R, et al. SWAP: A server-scale communication-aware chiplet-based manycore PIM accelerator[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022, 41(11): 4145–4156. doi: 10.1109/TCAD.2022.3197500.
    [20] 何斯琪, 穆琛, 陳遲曉. 基于存算一體集成芯片的大語(yǔ)言模型專用硬件架構(gòu)[J]. 中興通訊技術(shù), 2024, 30(2): 37–42. doi: 10.12142/ZTETJ.202402006.

    HE Siqi, MU Chen, and CHEN Chixiao. Large language model specific hardware architecture based on integrated compute-in-memory chips[J]. ZTE Technology Journal, 2024, 30(2): 37–42. doi: 10.12142/ZTETJ.202402006.
    [21] CHEN Yiran, XIE Yuan, SONG Linghao, et al. A survey of accelerator architectures for deep neural networks[J]. Engineering, 2020, 6(3): 264–274. doi: 10.1016/j.eng.2020.01.007.
    [22] SONG Linghao, CHEN Fan, ZHUO Youwei, et al. AccPar: Tensor partitioning for heterogeneous deep learning accelerators[C]. 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, USA, 2020: 342–355. doi: 10.1109/HPCA47549.2020.00036.
    [23] DE MOURA L and BJ?RNER N. Z3: An efficient SMT solver[C]. The 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Budapest, Hungary, 2008: 337–340. doi: 10.1007/978-3-540-78800-3_24.
    [24] PAPAIOANNOU G I, KOZIRI M, LOUKOPOULOS T, et al. On combining wavefront and tile parallelism with a novel GPU-friendly fast search[J]. Electronics, 2023, 12(10): 2223. doi: 10.3390/electronics12102223.
  • 加載中
圖(8) / 表(3)
計(jì)量
  • 文章訪問(wèn)數(shù):  407
  • HTML全文瀏覽量:  155
  • PDF下載量:  67
  • 被引次數(shù): 0
出版歷程
  • 收稿日期:  2024-04-16
  • 修回日期:  2024-09-13
  • 網(wǎng)絡(luò)出版日期:  2024-09-30
  • 刊出日期:  2024-11-01

目錄

    /

    返回文章
    返回