高吞吐率雙模浮點可重構(gòu)FFT處理器設(shè)計實現(xiàn)

魏星; 黃志洪; 楊海鋼

doi:10.11999/JEIT180170

高吞吐率雙模浮點可重構(gòu)FFT處理器設(shè)計實現(xiàn)

doi: 10.11999/JEIT180170

魏星^{1, 2},
黃志洪¹,
楊海鋼^{1, 2, ,}

1.
中國科學(xué)院電子學(xué)研究所 ??北京 ??100190
2.
中國科學(xué)院大學(xué) ??北京 ??100190

基金項目: 國家自然科學(xué)基金(61704173, 61474120)，北京市科技重大專項課題(Z171100000117019)

詳細信息

作者簡介:
魏星：男，1991年生，博士，研究方向為算法硬件加速設(shè)計、可重構(gòu)計算芯片架構(gòu)設(shè)計

黃志洪：男，1984年生，助理研究員，研究方向為可編程邏輯結(jié)構(gòu)設(shè)計、新型卷積神經(jīng)網(wǎng)絡(luò)芯片體系架構(gòu)開發(fā)

楊海鋼：男，1960年生，研究員，研究方向為數(shù)?；旌闲盘柤呻娐吩O(shè)計、超大規(guī)模集成電路設(shè)計等

通訊作者:
楊海鋼　 yanghg@mail.ie.ac.cn

中圖分類號: TN47
計量
- 文章訪問數(shù): 2094
- HTML全文瀏覽量: 755
- PDF下載量: 49
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2018-02-08
- 修回日期: 2018-07-05
- 網(wǎng)絡(luò)出版日期: 2018-07-24
- 刊出日期: 2018-12-01

High Throughput Dual-mode Reconfigurable Floating-point FFT Processor

Xing WEI^{1, 2},
Zhihong HUANG¹,
Haigang YANG^{1, 2
, ,}

1.
Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China
2.
University of Chinese Academy of Sciences, Beijing 100190, China

Funds: The National Natural Science Foundation of China (61704173, 61474120), The Major Program of Beijing Science and Technology (Z171100000117019)

摘要

摘要: 高吞吐浮點可靈活重構(gòu)的快速傅里葉變換(FFT)處理器可滿足尖端雷達實時成像和高精度科學(xué)計算等多種應(yīng)用需求。與定點FFT相比，浮點運算復(fù)雜度更高，使得浮點型FFT的運算吞吐率與其實現(xiàn)面積、功耗之間的矛盾問題尤為突出。鑒于此，為降低運算復(fù)雜度，首先將大點數(shù)FFT分解成若干個小點數(shù)基2^k 級聯(lián)子級實現(xiàn)，提出分別針對128/256/512/1024/2048點FFT的優(yōu)化混合基算法。同時，結(jié)合所提出同時支持單通道單精度和雙通道半精度兩種浮點模式的新型融合加減與點乘運算單元，首次提出一款高吞吐率雙模浮點可變點FFT處理器結(jié)構(gòu)，并在28 nm標準CMOS工藝下進行設(shè)計并實現(xiàn)。實驗結(jié)果表明，單通道單精度和雙通道半精度浮點兩種模式下的運算吞吐率和輸出平均信號量化噪聲比分別為3.478 GSample/s, 135 dB和6.957 GSample/s, 60 dB。歸一化吞吐率面積比相比于現(xiàn)有其他浮點FFT實現(xiàn)可提高約12倍。
- 快速傅里葉變換 /
- 雙模浮點 /
- 混合基 /
- 融合運算單元
Abstract: In the advanced applications of real-time radar imaging and high-precision scientific computing systems, the design of high throughput and reconfigurable Floating-Point (FP) FFT accelerator is significant. Achieving high throughput FP FFT with low area and power cost poses a greater challenge due to high complexity of FP operations in comparison to fixed-point implementations. To address these issues, a serial of mixed-radix algorithms for 128/256/512/1024/2048-point FFT are proposed by decomposing long FFT into short implementations with cascaded radix-2^k stages so that the complexity of multiplications can be significantly reduced. Besides, two novel fused FP add-subtract and dot-product units for dual-mode functionality are proposed, which can either compute on a pair of double precision operands or on two pairs of single precision operands in parallel. Thus, a high throughput dual-mode floating-point variable length FFT is designed. The proposed processor is implemented based on SMIC 28 nm CMOS technology. Simulation results show that the throughput and Signal-to-Quantization Noise Ratio (SQNR) in single-channel single precision and dual-channel half precision floating-point mode are 3.478 GSample/s, 135 dB and 6.957 GSample/s, 60 dB respectively. Compare to the other FP FFT, this processor can achieve 12 times improvement of normalized throughput-area ratio.
- Fast Fourier Transform (FFT) /
- Dual-mode floating point /
- Mixed-radix /
- Fused arithmetic unit

HTML全文

圖 1 雙模浮點128/256/512/1024/2048點FFT頂層結(jié)構(gòu)框圖

下載: 全尺寸圖片幻燈片

圖 2 雙模浮點數(shù)據(jù)格式示意圖

下載: 全尺寸圖片幻燈片

圖 3 雙模浮點融合加減運算單元結(jié)構(gòu)

下載: 全尺寸圖片幻燈片

圖 4 雙模式前導(dǎo)零預(yù)測電路結(jié)構(gòu)

下載: 全尺寸圖片幻燈片

圖 5 雙模浮點融合點乘運算單元結(jié)構(gòu)

下載: 全尺寸圖片幻燈片

圖 6 雙模式4:2 CSA結(jié)構(gòu)

下載: 全尺寸圖片幻燈片

表 1 本文所提出的混合基算法

點數(shù)	優(yōu)化算法	每個子級相應(yīng)的基底
點數(shù)	優(yōu)化算法	1	2	3	4	5	6	7	8	9	10
128	2³-2²-2²	4	8	128	4	16	4
256	2⁴-2²-2²	4	16	4	256	4	16	4
512	2⁴-2²-2³	4	16	4	512	4	32	4	8
1024	2⁵-2²-2³	4	8	32	4	1024	4	32	4	8
2048	2⁵-2³-2³	4	8	32	4	2048	4	8	64	4	8

下載: 導(dǎo)出CSV

表 2 雙模浮點融合加減運算單元關(guān)鍵路徑

模塊名	流水級	延時(ns)
數(shù)據(jù)提取與尾數(shù)生成	1	0.43
指數(shù)比較與尾數(shù)交換	1	0.77
指數(shù)階差與尾數(shù)對齊	2	1.98
尾數(shù)求和差	3	1.47
LZA	3	0.35
規(guī)格化左移	3	0.16

下載: 導(dǎo)出CSV

表 3 雙模浮點融合加減運算單元性能比較結(jié)果

參數(shù)	文獻[6]	參考結(jié)構(gòu)T1	本文
工藝 (nm)	45	28	28
歸一化面積 (μm²)	5226 (100%)	2665 (51%)	2317 (44%)
工作頻率 (MHz)	1920	500	435
計算延遲 (ns)	1.0	6.0	6.9
功耗 (mW)	5.2	0.6	0.4
功耗×周期	2.70 (100%)	1.20 (44%)	0.84 (31%)

下載: 導(dǎo)出CSV

表 4 雙模浮點融合點乘運算單元關(guān)鍵路徑

模塊名	流水級	延時(ns)
數(shù)據(jù)提取與指數(shù)比較	1	0.33
尾數(shù)部分積相乘	1	0.87
指數(shù)階差與乘積對齊	2	1.25
雙模4:2 CSA	2	0.20
雙加法舍入路徑	2	0.53
LZD與規(guī)格化左移	3	1.98

下載: 導(dǎo)出CSV

表 5 雙模浮點融合點乘運算單元性能比較結(jié)果

參數(shù)	文獻[5]	參考結(jié)構(gòu)T2	本文
工藝 (nm)	45	28	28
歸一化面積 (μm²)	12865 (100%)	10336 (80%)	10701 (83%)
工作頻率 (MHz)	1493	500	435
計算延遲 (ns)	2.1	6.0	6.9
功耗 (mW)	16.9	2.7	2.5
功耗×周期	11.3 (100%)	5.3 (47%)	5.6 (50%)

下載: 導(dǎo)出CSV

表 6 雙模浮點FFT整體性能對比

性能參數(shù)	文獻[2]	文獻[9]	文獻[16]	本文
工藝 (nm)	90	65	55	28
FFT結(jié)構(gòu)	基于存儲器	Hybrid	MDF	MDF
FFT點數(shù)	512	1024	1024	128/256/512/1024/2048
并行度	8	1	1	8
數(shù)據(jù)類型：字長 (bit)	塊浮點：12	浮點：32	定點：16	浮點：32/浮點：16
平均輸出SQNR (dB)	57	139	55	SP：135/HP：60
時鐘頻率 (MHz)	324	400	200	435
計算時間 (μs)	0.3	2.6	5.1	0.2/0.3/0.6/1.1/2.2
運算吞吐率 (MSample/s)	2592	400	200	SP：3478/HP：6957
平均功耗 (mW)	42	417	8	104 @435 MHz
有效面積 (mm²)	0.93	1.19	0.15	1.41
歸一化面積	10.0	22.1	1.9	16.0
歸一化吞吐率面積比	259	18	103	SP：220/HP：440

下載: 導(dǎo)出CSV

參考文獻(17)

呂倩, 蘇濤. 基于改進型快速雙線性參數(shù)估計的復(fù)雜運動目標ISAR成像[J]. 電子與信息學(xué)報, 2016, 38(9): 2301–2308 doi: 10.11999/JEIT151359

Lü Qian and SU Tao. ISAR imaging of targets with complex motion based on the modified fast bilinear parameter estimation[J]. Journal of Electronics&Information Technology, 2016, 38(9): 2301–2308 doi: 10.11999/JEIT151359

HUANG Shenjui and CHEN S. A high-throughput radix-16 FFT processor with parallel and normal input/output ordering for IEEE 802.15.3c systems[J]. IEEE Transactions on Circuits and Systems Ⅰ:Regular Papers, 2012, 59(8): 1752–1765 doi: 10.1109/TCSI.2011.2180430

LAN G and FRANK H. Digital Processing of Synthetic Aperture Radar Data: Algorithms and Implementation[M]. Boston: Artech House Publishers, 2005: 154–210.

陳杰男, 費超, 袁建生, 等. 超高速全并行快速傅里葉變換器[J]. 電子與信息學(xué)報, 2016, 38(9): 2410–2414 doi: 10.11999/JEIT160036

CHEN Jienan, FEI Chao, YUAN Jiansheng, et al. An ultra-high-speed fully-parallel fast Fourier transform design[J]. Journal of Electronics&Information Technology, 2016, 38(9): 2410–2414 doi: 10.11999/JEIT160036

JONGWOOK S and EARL E. Improved architectures for a floating-point fused dot product unit[C]. IEEE Symposium on Computer Arithmetic, Austin, USA, 2013: 41–48. doi: 10.1109/ARITH.2013.26.

JONGWOOK S and EARL E. Improved architectures for a fused floating-point add-subtract unit[J]. IEEE Transactions on Circuits and Systems Ⅰ:Regular Papers, 2012, 59(10): 2285–2291 doi: 10.1109/TCSI.2012.2188955

CHO T and LEE H. A high-speed low-complexity modified radix-2⁵ FFT processor for high rate WPAN applications[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2013, 21(1): 187–191 doi: 10.1109/TVLSI.2011.2182068

WANG Chao, YAN Yuwei, and FU Xiaoyu. A high-throughput low-complexity radix-2⁴-2²-2³ FFT/IFFT processor with parallel and normal input/output order for IEEE 802.11ad systems[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2015, 23(11): 2728–2732 doi: 10.1109/TVLSI.2014.2365586

WANG Mingyu and LI Zhaolin. A hybrid SDC/SDF architecture for area and power minimization of floating-point FFT computations[C]. IEEE International Symposium on Circuits and Systems, Montreal, Canada, 2016: 2170–2173. doi: 10.1109/ISCAS.2016.7539011.

EARL E and HANI H. FFT implementation with fused floating-point operations[J]. IEEE Transactions on Computers, 2012, 61(2): 284–288 doi: 10.1109/TC.2010.271

TANG S N, TSAI J W, and CHANG T Y. A 2.4-GS/s FFT processor for OFDM-based WPAN applications[J]. IEEE Transactions on Circuits and Systems Ⅱ:Express Briefs, 2010, 57(6): 451–455 doi: 10.1109/TCSII.2010.2048373

NIE Zedong, ZHANG Fengjuan, LI Jie, et al. Low-power digital ASIC for on-chip spectral analysis of low-frequency physiological signals[J]. Journal of Semiconductors, 2012, 33(6): 67–70 doi: 10.1088/1674-4926

IEEE 754-2008. IEEE Standard for Floating-Point Arithmetic[S]. 2008. doi: 10.1109/IEEESTD.2008.5976968.

PETER K. Correcting the normalization shift of redundant binary representations[J]. IEEE Transactions on Computers, 2009, 58(10): 1453–1439 doi: 10.1109/TC.2009.38

YANG C H, YU T H, and DEJAN M. Power and area minimization of reconfigurable FFT processors: A 3GPP-LTE example[J]. IEEE Journal of Solid-State Circuits, 2011, 47(3): 757–768 doi: 10.1109/JSSC.2011.2176163

MARIO G, HUANG S J, CHEN S G, et al. The serial commutator FFT[J]. IEEE Transactions on Circuits and Systems Ⅱ:Express Briefs, 2016, 63(10): 974–978 doi: 10.1109/TCSII.2016.2538119

YANG K J, TSAI S H, and CHUANG G. MDC FFT/IFFT processor with variable length for MIMO-OFDM systems[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2013, 21(4): 720–731 doi: 10.1109/TVLSI.2012.2194315

相關(guān)文章

施引文獻

資源附件(0)

訪問統(tǒng)計