面向眾核密碼處理器的高效負載均衡技術(shù)
doi: 10.11999/JEIT180623
-
1.
解放軍信息工程大學(xué) ??鄭州 ??450001
-
2.
復(fù)旦大學(xué)專用集成電路與系統(tǒng)國家重點實驗室 ??上海 ??201203
Efficient Workload Balance Technology on Many-core Crypto Processor
-
1.
The PLA Information Engineering University, Zhengzhou 450001, China
-
2.
State Key Laboratory of ASIC and System, Fudan University, Shanghai 201203, China
-
摘要:
工作負載分配不均是制約眾核密碼平臺資源利用率提高的重要因素,動態(tài)負載分配可提高平臺資源利用率,但具有一定開銷;所以更高的負載均衡頻率并不一定帶來更高的負載均衡增益。因此,該文建立了關(guān)于負載均衡增益率與負載均衡頻率的數(shù)學(xué)模型?;谀P?,提出一種面向眾核密碼平臺的無沖突負載均衡策略和一種基于硬件作業(yè)隊列的“可擴展-可移植”負載均衡引擎——“簇間微網(wǎng)絡(luò)-簇內(nèi)環(huán)陣列”。實驗證明:在性能、延時功耗積、資源利用率和負載均衡度方面,該文設(shè)計的負載均衡引擎與基于“作業(yè)竊取”的軟件技術(shù)相比平均優(yōu)化約4.06倍、7.17倍、23.01%和2.15倍;與基于“作業(yè)竊取”的硬件技術(shù)相比約優(yōu)化1.75倍、2.45倍、10.2%、和1.41倍;與理想硬件技術(shù)相比,密碼算法吞吐率平均只降低了約5.67%(最低3%)。實驗結(jié)果表明該文技術(shù)具有良好的可擴展性和可移植性。
Abstract:Imbalanced workload distribution results in low resource utilization of many-core crypto-platform. Dynamic workload allocation can improve the resource utilization with some overhead. Therefore, a higher frequency of workload balancing is not equivalent to higher gains. This paper establishes a mathematical model for gain rate and frequency of workload balancing. Based on this model, a collision-free workload balancing policy is proposed for many-core crypto systems, and a hierarchical "expandable-portable" engine is put forward, which consists of "Inter-cluster micro-network and intra-cluster ring-array" adopting hardware job queue technology. Experiment results show that the proposed workload-balancing engine is 4.06, 7.17, 23.01% and 2.15 times higher than the software technology based on " job stealing” in terms of performance, delay power consumption, resource utilization and workload balance; 1.75, 2.45, 10.2%, and 1.41 times better compared with the hardware technology based on "job stealing". By contrast with the ideal hardware technology, the average throughput of encryption algorithms is only decreased by 5.67% (the lowest 3%). The experiment also proves the scalability and portability of the proposed technique.
-
表 1 無沖突負載均衡算法
Require:${L_{{\rm{td}}}},{N_{{\rm{cl}}}},{L_{{\rm{dt}}}},{T_{{\rm{cp}}}},{\rm{Thpu}}{{\rm{t}}_{{\rm{SM}}}},{\rm{Thpu}}{{\rm{t}}_{{\rm{Nt}}}},{S_{\rm{c}}}$//簇間均衡操作集合 1 Assign ${\rm{Balance\_c}} \leftarrow {\rm{True,}}{{{N}}_2} \leftarrow {\rm{Num}}[{S_{\rm{c}}}],{n_2} \leftarrow 0,{t_2} \leftarrow {T_{{\rm{op}}}}$
${t_3} \leftarrow ({L_{{\rm{td}}}} + {L_{{\rm{dt}}}})/{\rm{Thpu}}{{\rm{t}}_{{\rm{SM}}}},{t_4} \leftarrow ({L_{{\rm{td}}}} + {L_{{\rm{dt}}}})/{\rm{Thpu}}{{\rm{t}}_{Nt}}$2 while new ${\rm{job[}}k{\rm{][}}i{\rm{] = = True}}$ do//更新負載情況 3 ${t_{\rm{1}}} \leftarrow {L_{{\rm{tdki}}}} \cdot {N_{{\rm{clki}}}} \cdot {L_{{\rm{dtki}}}};{\rm{Addc}}[k] \leftarrow {\rm{Addc}}[k] + {t_1};$ 4 end while 5 while ${\rm{Balance\_c}} = = {\rm{True}}$ do//簇間負載均衡 6 for $k \leftarrow 0$ to ${N_2}$ do 7 for $w \leftarrow 0$ to ${N_2}$ do 8 if $w \ne k$ then 9 if ${\rm{Thpu}}{{\rm{t}}_{{\rm{SM}}}} \ge {\rm{Thpu}}{{\rm{t}}_{Nt}}$ then 10 if ${\rm{Addp[}}k{\rm{] - Addp[}}w{\rm{] - }}{t_{\rm{1}}}{\rm{ > }}{t_{2}}{\rm{ + }}{t_{\rm{4}}}$ then//式(13) 11 ${\rm{Balancep\_Request[}}w{\rm{][}}k{]} \leftarrow {\rm{True;}}$ 12 ${\rm{Balancec\_Request[}}w{]} = {\rm{Balancec\_Request[}}w{]} + 1;$ 13 end if 14 else then 15 if ${\rm{Addp[}}k{\rm{] - Addp[}}w{\rm{] - }}{t_{\rm{1}}} > {t_2} + {t_3}$ then//式(13) 16 ${\rm{Balancep\_Request[}}w{\rm{][}}k{]} \leftarrow {\rm{True;}}$ 17 ${\rm{Balancec\_Request[}}w{]} = {\rm{Balancec\_Request[}}w{]} + 1;$ 18 end if 19 end else 20 end for 21 if ${\rm{Balancec\_Request[}}w{]} > 1$ then 22 ${\rm{choose }}\ p{\rm{[}}{{\rm{n}}_{2}}{\rm{];}}{n_2} \leftarrow {(}{n_{2}}{\rm{ + 1)mol}}{N_{2}};$ 23 end if 24 end for 25 end while 下載: 導(dǎo)出CSV
表 2 仿真參數(shù)
參數(shù) 值 核心數(shù)目 4~64 本地緩存大小 8 kB 共享緩存大小 16 MB 簇內(nèi)互連方式 二向環(huán) 簇間互連方式 2D-mesh 共享訪問延時/本地訪問延時 3 下載: 導(dǎo)出CSV
表 3 測試基準(zhǔn)說明
測試基準(zhǔn) 算法分類 算法特征
(bit)作業(yè)數(shù)目 平均作業(yè)
大小(kB)測試基準(zhǔn) 算法分類 算法特征
(bit)作業(yè)數(shù)目 平均作業(yè)
大小(kB)備注 DES 分組 64 1024 54.4 A5 序列 1 1024 41.1 算法特征表示分組/雜湊算法的處理位寬或者序列算法的輸出位寬 AES 分組 128 1024 46.3 ZUC 序列 32 1024 50.4 SM4 分組 512 1024 32.2 RC4 序列 8 1024 47.3 SHA256 雜湊 512 1024 44.6 RSA 公鑰 — 1024 40.3 SM3 雜湊 512 1024 38.8 SM2 公鑰 — 1024 22.8 SHA-1 雜湊 512 1024 33.2 下載: 導(dǎo)出CSV
-
KIM C and HUH J. Exploring the design space of fair scheduling supports for asymmetric multicore systems[J]. IEEE Transactions on Computers, 2018, 67(8): 1136–1152. doi: 10.1109/TC.2018.2796077 KIM K W, CHO Y, EO J, et al. System-wide time versus density tradeoff in real-time multicore fluid scheduling[J]. IEEE Transactions on Computers, 2018, 67(7): 1007–1022. doi: 10.1109/TC.2018.2793919 LEE J, NICOPOULOS C, LEE H G, et al. IsoNet: Hardware-based job queue management for many core architectures[J]. IEEE Transactions on Very Large Scale Integration Systems, 2013, 21(6): 1080–1093. doi: 10.1109/TVLSI.2012.2202699 KUMAR S, HUGHES C J and NGUYEN A. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors[C]. International Symposium on Computer Architecture, California, USA, 2007: 162–173. CHEN J, JUANG P, KO K, et al. Hardware-modulated parallelism in chip multiprocessors[J]. ACM Sigarch Computer Architecture News, 2005, 33(4): 54–63. doi: 10.1145/1105734.1105742 LEE J, NICOPOULOS C, LEE Y, et al. Hardware-based job queue management for manycore architectures and OpenMP environments[J]. IEEE International Parallel & Distributed Processing Symposium, 2011, 21(6): 407–418. doi: 10.1109/IPDPS.2011.47 劉宗斌, 馬原, 荊繼武, 等. SM3哈希算法的硬件實現(xiàn)與研究[J]. 信息網(wǎng)絡(luò)安全, 2011, 59(9): 191–193. doi: 10.3969/j.issn.1671-1122.2011.09.059LIU Zongbin, MA Yuan, JING Jiwu, et al. Implementation of SM3 hash function on FPGA[J]. Information Network Security, 2011, 59(9): 191–193. doi: 10.3969/j.issn.1671-1122.2011.09.059 徐金甫, 楊宇航. SM4算法在粗粒度陣列平臺的并行化映射[J]. 電子技術(shù)應(yīng)用, 2017, 43(4): 39–42.XU Jinfu and YANG Yuhang. Parallel mapping of SM4 algorithm on coarse-grained array platform[J]. Electronic Technology Application, 2017, 43(4): 39–42. DUBEY P. Recognition, mining and synthesis moves computers to the era of tera[J]. Technology@Intel Magazine, 2005, 9(2): 1–10. RATTNER J. Cool codes for hot chips: A quantitative basis for multi-core design[C]. Hot Chips 18 Symposium IEEE, Cupertino, USA, 2016: 1–28. AN H, TAURA K, and SPOTTER D. A tool for spotting scheduler-caused delays in task parallel runtime systems[C]. IEEE International Conference on CLUSTER Computing, Hawaii, USA, 2017: 114–125. KWOK Y K and AHMAD I. Static scheduling algorithms for allocating directed task graphs to multiprocessors[J]. ACM Computing Surveys, 1999, 31(4): 406–471. doi: 10.1145/344588.344618 TITHI J J, MATANI D, MENGHANI G, et al. Avoiding locks and atomic instructions in shared-memory parallel BFS using optimistic parallelization[C]. Parallel and Distributed Processing Symposium Workshops & Phd Forum IEEE, Cambridge, UK, 2013: 1628–1637. MOON S W, REXFORD J, and SHIN K G. Scalable hardware priority queue architectures for high-speed packet switches[J]. IEEE Transactions on Computers, 2000, 49(11): 1215–1227. doi: 10.1109/RTTAS.1997.601359 CHEN Quan, ZHENG Long, and GUO Minyi. Adaptive Demand-aware Work-stealing in Multi-programmed Multi-core Architectures[J]. Concurrency and Computation: practice & Experience, 2016, 28(2): 455–471. doi: 10.1002.cpe.3619 -