面向眾核密碼處理器的高效負載均衡技術(shù)

戴紫彬; 尹安琪; 曲彤洲; 南龍梅

doi:10.11999/JEIT180623

面向眾核密碼處理器的高效負載均衡技術(shù)

doi: 10.11999/JEIT180623

戴紫彬¹,
尹安琪^1, ,,
曲彤洲¹,
南龍梅^{1, 2}

1.
解放軍信息工程大學(xué) ??鄭州 ??450001
2.
復(fù)旦大學(xué)專用集成電路與系統(tǒng)國家重點實驗室 ??上海 ??201203

詳細信息

作者簡介:
戴紫彬：男，1966年生，教授，博士生導(dǎo)師，研究方向為可重構(gòu)計算與安全專用芯片設(shè)計

尹安琪：女，1995年生，碩士生，研究方向為可重構(gòu)計算與信息安全

曲彤洲：男，1994年生，碩士生，研究方向為可重構(gòu)計算與信息安全

南龍梅：女，1981年生，講師，博士，研究方向為可重構(gòu)安全芯片設(shè)計

通訊作者:
尹安琪　yinaq0222@foxmail.com

中圖分類號: TP338.6
計量
- 文章訪問數(shù): 1568
- HTML全文瀏覽量: 525
- PDF下載量: 60
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2018-06-26
- 修回日期: 2018-11-27
- 網(wǎng)絡(luò)出版日期: 2018-12-03
- 刊出日期: 2019-02-01

Efficient Workload Balance Technology on Many-core Crypto Processor

Zibin DAI¹,
Anqi YIN^{1
, ,},
Tongzhou QU¹,
Longmei NAN^{1, 2}

1.
The PLA Information Engineering University, Zhengzhou 450001, China
2.
State Key Laboratory of ASIC and System, Fudan University, Shanghai 201203, China

摘要

摘要:
工作負載分配不均是制約眾核密碼平臺資源利用率提高的重要因素，動態(tài)負載分配可提高平臺資源利用率，但具有一定開銷；所以更高的負載均衡頻率并不一定帶來更高的負載均衡增益。因此，該文建立了關(guān)于負載均衡增益率與負載均衡頻率的數(shù)學(xué)模型?；谀Ｐ?，提出一種面向眾核密碼平臺的無沖突負載均衡策略和一種基于硬件作業(yè)隊列的“可擴展-可移植”負載均衡引擎——“簇間微網(wǎng)絡(luò)-簇內(nèi)環(huán)陣列”。實驗證明：在性能、延時功耗積、資源利用率和負載均衡度方面，該文設(shè)計的負載均衡引擎與基于“作業(yè)竊取”的軟件技術(shù)相比平均優(yōu)化約4.06倍、7.17倍、23.01%和2.15倍；與基于“作業(yè)竊取”的硬件技術(shù)相比約優(yōu)化1.75倍、2.45倍、10.2%、和1.41倍；與理想硬件技術(shù)相比，密碼算法吞吐率平均只降低了約5.67%(最低3%)。實驗結(jié)果表明該文技術(shù)具有良好的可擴展性和可移植性。
- 眾核密碼處理器 /
- 負載均衡策略 /
- 負載均衡引擎 /
- 無沖突
Abstract:
Imbalanced workload distribution results in low resource utilization of many-core crypto-platform. Dynamic workload allocation can improve the resource utilization with some overhead. Therefore, a higher frequency of workload balancing is not equivalent to higher gains. This paper establishes a mathematical model for gain rate and frequency of workload balancing. Based on this model, a collision-free workload balancing policy is proposed for many-core crypto systems, and a hierarchical "expandable-portable" engine is put forward, which consists of "Inter-cluster micro-network and intra-cluster ring-array" adopting hardware job queue technology. Experiment results show that the proposed workload-balancing engine is 4.06, 7.17, 23.01% and 2.15 times higher than the software technology based on " job stealing” in terms of performance, delay power consumption, resource utilization and workload balance; 1.75, 2.45, 10.2%, and 1.41 times better compared with the hardware technology based on "job stealing". By contrast with the ideal hardware technology, the average throughput of encryption algorithms is only decreased by 5.67% (the lowest 3%). The experiment also proves the scalability and portability of the proposed technique.
- Many-core crypto processor /
- Workload balance strategy /
- Workload balance engine /
- Collision-free

HTML全文

圖 1 眾核密碼處理器架構(gòu)模型

下載: 全尺寸圖片幻燈片

圖 2 簇內(nèi)作業(yè)隊列架構(gòu)

下載: 全尺寸圖片幻燈片

圖 3 全局及局部均衡操作執(zhí)行流程

下載: 全尺寸圖片幻燈片

圖 4 簇間負載均衡微網(wǎng)絡(luò)

下載: 全尺寸圖片幻燈片

圖 5 性能對比圖

下載: 全尺寸圖片幻燈片

圖 6 延時功耗積及資源利用率、負載均衡度對比圖

下載: 全尺寸圖片幻燈片

表 1 無沖突負載均衡算法

Require:${L_{{\rm{td}}}},{N_{{\rm{cl}}}},{L_{{\rm{dt}}}},{T_{{\rm{cp}}}},{\rm{Thpu}}{{\rm{t}}_{{\rm{SM}}}},{\rm{Thpu}}{{\rm{t}}_{{\rm{Nt}}}},{S_{\rm{c}}}$//簇間均衡操作集合

　1 Assign ${\rm{Balance\_c}} \leftarrow {\rm{True,}}{{{N}}_2} \leftarrow {\rm{Num}}[{S_{\rm{c}}}],{n_2} \leftarrow 0,{t_2} \leftarrow {T_{{\rm{op}}}}$
　　 ${t_3} \leftarrow ({L_{{\rm{td}}}} + {L_{{\rm{dt}}}})/{\rm{Thpu}}{{\rm{t}}_{{\rm{SM}}}},{t_4} \leftarrow ({L_{{\rm{td}}}} + {L_{{\rm{dt}}}})/{\rm{Thpu}}{{\rm{t}}_{Nt}}$

　2 while new ${\rm{job[}}k{\rm{][}}i{\rm{] = = True}}$ do//更新負載情況

　3 ${t_{\rm{1}}} \leftarrow {L_{{\rm{tdki}}}} \cdot {N_{{\rm{clki}}}} \cdot {L_{{\rm{dtki}}}};{\rm{Addc}}[k] \leftarrow {\rm{Addc}}[k] + {t_1};$

　4 end while

　5 while ${\rm{Balance\_c}} = = {\rm{True}}$ do//簇間負載均衡

　6 for $k \leftarrow 0$ to ${N_2}$ do

　7 for $w \leftarrow 0$ to ${N_2}$ do

　8 if $w \ne k$ then

　9 if ${\rm{Thpu}}{{\rm{t}}_{{\rm{SM}}}} \ge {\rm{Thpu}}{{\rm{t}}_{Nt}}$ then

　10 if ${\rm{Addp[}}k{\rm{] - Addp[}}w{\rm{] - }}{t_{\rm{1}}}{\rm{ > }}{t_{2}}{\rm{ + }}{t_{\rm{4}}}$ then//式(13)

　11 ${\rm{Balancep\_Request[}}w{\rm{][}}k{]} \leftarrow {\rm{True;}}$

　12 ${\rm{Balancec\_Request[}}w{]} = {\rm{Balancec\_Request[}}w{]} + 1;$

　13 end if

　14 else then

　15 if ${\rm{Addp[}}k{\rm{] - Addp[}}w{\rm{] - }}{t_{\rm{1}}} > {t_2} + {t_3}$ then//式(13)

　16 ${\rm{Balancep\_Request[}}w{\rm{][}}k{]} \leftarrow {\rm{True;}}$

　17 ${\rm{Balancec\_Request[}}w{]} = {\rm{Balancec\_Request[}}w{]} + 1;$

　18 end if

　19 end else

　20 end for

　21 if ${\rm{Balancec\_Request[}}w{]} > 1$ then

　22 ${\rm{choose }}\ p{\rm{[}}{{\rm{n}}_{2}}{\rm{];}}{n_2} \leftarrow {(}{n_{2}}{\rm{ + 1)mol}}{N_{2}};$

　23 end if

　24 end for

　25 end while

下載: 導(dǎo)出CSV

表 2 仿真參數(shù)

參數(shù)	值
核心數(shù)目	4～64
本地緩存大小	8 kB
共享緩存大小	16 MB
簇內(nèi)互連方式	二向環(huán)
簇間互連方式	2D-mesh
共享訪問延時/本地訪問延時	3

下載: 導(dǎo)出CSV

表 3 測試基準(zhǔn)說明

測試基準(zhǔn)	算法分類	算法特征 (bit)	作業(yè)數(shù)目	平均作業(yè) 大小(kB)	測試基準(zhǔn)	算法分類	算法特征 (bit)	作業(yè)數(shù)目	平均作業(yè) 大小(kB)	備注
DES	分組	64	1024	54.4	A5	序列	1	1024	41.1	算法特征表示分組/雜湊算法的處理位寬或者序列算法的輸出位寬
AES	分組	128	1024	46.3	ZUC	序列	32	1024	50.4
SM4	分組	512	1024	32.2	RC4	序列	8	1024	47.3
SHA256	雜湊	512	1024	44.6	RSA	公鑰	—	1024	40.3
SM3	雜湊	512	1024	38.8	SM2	公鑰	—	1024	22.8
SHA-1	雜湊	512	1024	33.2

下載: 導(dǎo)出CSV

參考文獻(15)

KIM C and HUH J. Exploring the design space of fair scheduling supports for asymmetric multicore systems[J]. IEEE Transactions on Computers, 2018, 67(8): 1136–1152. doi: 10.1109/TC.2018.2796077

KIM K W, CHO Y, EO J, et al. System-wide time versus density tradeoff in real-time multicore fluid scheduling[J]. IEEE Transactions on Computers, 2018, 67(7): 1007–1022. doi: 10.1109/TC.2018.2793919

LEE J, NICOPOULOS C, LEE H G, et al. IsoNet: Hardware-based job queue management for many core architectures[J]. IEEE Transactions on Very Large Scale Integration Systems, 2013, 21(6): 1080–1093. doi: 10.1109/TVLSI.2012.2202699

KUMAR S, HUGHES C J and NGUYEN A. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors[C]. International Symposium on Computer Architecture, California, USA, 2007: 162–173.

CHEN J, JUANG P, KO K, et al. Hardware-modulated parallelism in chip multiprocessors[J]. ACM Sigarch Computer Architecture News, 2005, 33(4): 54–63. doi: 10.1145/1105734.1105742

LEE J, NICOPOULOS C, LEE Y, et al. Hardware-based job queue management for manycore architectures and OpenMP environments[J]. IEEE International Parallel & Distributed Processing Symposium, 2011, 21(6): 407–418. doi: 10.1109/IPDPS.2011.47

劉宗斌, 馬原, 荊繼武, 等. SM3哈希算法的硬件實現(xiàn)與研究[J]. 信息網(wǎng)絡(luò)安全, 2011, 59(9): 191–193. doi: 10.3969/j.issn.1671-1122.2011.09.059

LIU Zongbin, MA Yuan, JING Jiwu, et al. Implementation of SM3 hash function on FPGA[J]. Information Network Security, 2011, 59(9): 191–193. doi: 10.3969/j.issn.1671-1122.2011.09.059

徐金甫, 楊宇航. SM4算法在粗粒度陣列平臺的并行化映射[J]. 電子技術(shù)應(yīng)用, 2017, 43(4): 39–42.

XU Jinfu and YANG Yuhang. Parallel mapping of SM4 algorithm on coarse-grained array platform[J]. Electronic Technology Application, 2017, 43(4): 39–42.

DUBEY P. Recognition, mining and synthesis moves computers to the era of tera[J]. Technology@Intel Magazine, 2005, 9(2): 1–10.

RATTNER J. Cool codes for hot chips: A quantitative basis for multi-core design[C]. Hot Chips 18 Symposium IEEE, Cupertino, USA, 2016: 1–28.

AN H, TAURA K, and SPOTTER D. A tool for spotting scheduler-caused delays in task parallel runtime systems[C]. IEEE International Conference on CLUSTER Computing, Hawaii, USA, 2017: 114–125.

KWOK Y K and AHMAD I. Static scheduling algorithms for allocating directed task graphs to multiprocessors[J]. ACM Computing Surveys, 1999, 31(4): 406–471. doi: 10.1145/344588.344618

TITHI J J, MATANI D, MENGHANI G, et al. Avoiding locks and atomic instructions in shared-memory parallel BFS using optimistic parallelization[C]. Parallel and Distributed Processing Symposium Workshops & Phd Forum IEEE, Cambridge, UK, 2013: 1628–1637.

MOON S W, REXFORD J, and SHIN K G. Scalable hardware priority queue architectures for high-speed packet switches[J]. IEEE Transactions on Computers, 2000, 49(11): 1215–1227. doi: 10.1109/RTTAS.1997.601359

CHEN Quan, ZHENG Long, and GUO Minyi. Adaptive Demand-aware Work-stealing in Multi-programmed Multi-core Architectures[J]. Concurrency and Computation: practice & Experience, 2016, 28(2): 455–471. doi: 10.1002.cpe.3619

相關(guān)文章

施引文獻

資源附件(0)

訪問統(tǒng)計