高吞吐率雙模浮點可重構(gòu)FFT處理器設(shè)計實現(xiàn)
doi: 10.11999/JEIT180170
-
1.
中國科學(xué)院電子學(xué)研究所 ??北京 ??100190
-
2.
中國科學(xué)院大學(xué) ??北京 ??100190
High Throughput Dual-mode Reconfigurable Floating-point FFT Processor
-
1.
Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China
-
2.
University of Chinese Academy of Sciences, Beijing 100190, China
-
摘要: 高吞吐浮點可靈活重構(gòu)的快速傅里葉變換(FFT)處理器可滿足尖端雷達實時成像和高精度科學(xué)計算等多種應(yīng)用需求。與定點FFT相比,浮點運算復(fù)雜度更高,使得浮點型FFT的運算吞吐率與其實現(xiàn)面積、功耗之間的矛盾問題尤為突出。鑒于此,為降低運算復(fù)雜度,首先將大點數(shù)FFT分解成若干個小點數(shù)基2k 級聯(lián)子級實現(xiàn),提出分別針對128/256/512/1024/2048點FFT的優(yōu)化混合基算法。同時,結(jié)合所提出同時支持單通道單精度和雙通道半精度兩種浮點模式的新型融合加減與點乘運算單元,首次提出一款高吞吐率雙模浮點可變點FFT處理器結(jié)構(gòu),并在28 nm標準CMOS工藝下進行設(shè)計并實現(xiàn)。實驗結(jié)果表明,單通道單精度和雙通道半精度浮點兩種模式下的運算吞吐率和輸出平均信號量化噪聲比分別為3.478 GSample/s, 135 dB和6.957 GSample/s, 60 dB。歸一化吞吐率面積比相比于現(xiàn)有其他浮點FFT實現(xiàn)可提高約12倍。Abstract: In the advanced applications of real-time radar imaging and high-precision scientific computing systems, the design of high throughput and reconfigurable Floating-Point (FP) FFT accelerator is significant. Achieving high throughput FP FFT with low area and power cost poses a greater challenge due to high complexity of FP operations in comparison to fixed-point implementations. To address these issues, a serial of mixed-radix algorithms for 128/256/512/1024/2048-point FFT are proposed by decomposing long FFT into short implementations with cascaded radix-2k stages so that the complexity of multiplications can be significantly reduced. Besides, two novel fused FP add-subtract and dot-product units for dual-mode functionality are proposed, which can either compute on a pair of double precision operands or on two pairs of single precision operands in parallel. Thus, a high throughput dual-mode floating-point variable length FFT is designed. The proposed processor is implemented based on SMIC 28 nm CMOS technology. Simulation results show that the throughput and Signal-to-Quantization Noise Ratio (SQNR) in single-channel single precision and dual-channel half precision floating-point mode are 3.478 GSample/s, 135 dB and 6.957 GSample/s, 60 dB respectively. Compare to the other FP FFT, this processor can achieve 12 times improvement of normalized throughput-area ratio.
-
表 1 本文所提出的混合基算法
點數(shù) 優(yōu)化算法 每個子級相應(yīng)的基底 1 2 3 4 5 6 7 8 9 10 128 23-22-22 4 8 128 4 16 4 256 24-22-22 4 16 4 256 4 16 4 512 24-22-23 4 16 4 512 4 32 4 8 1024 25-22-23 4 8 32 4 1024 4 32 4 8 2048 25-23-23 4 8 32 4 2048 4 8 64 4 8 下載: 導(dǎo)出CSV
表 2 雙模浮點融合加減運算單元關(guān)鍵路徑
模塊名 流水級 延時(ns) 數(shù)據(jù)提取與尾數(shù)生成 1 0.43 指數(shù)比較與尾數(shù)交換 1 0.77 指數(shù)階差與尾數(shù)對齊 2 1.98 尾數(shù)求和差 3 1.47 LZA 3 0.35 規(guī)格化左移 3 0.16 下載: 導(dǎo)出CSV
表 3 雙模浮點融合加減運算單元性能比較結(jié)果
參數(shù) 文獻[6] 參考結(jié)構(gòu)T1 本文 工藝 (nm) 45 28 28 歸一化面積 (μm2) 5226 (100%) 2665 (51%) 2317 (44%) 工作頻率 (MHz) 1920 500 435 計算延遲 (ns) 1.0 6.0 6.9 功耗 (mW) 5.2 0.6 0.4 功耗×周期 2.70 (100%) 1.20 (44%) 0.84 (31%) 下載: 導(dǎo)出CSV
表 4 雙模浮點融合點乘運算單元關(guān)鍵路徑
模塊名 流水級 延時(ns) 數(shù)據(jù)提取與指數(shù)比較 1 0.33 尾數(shù)部分積相乘 1 0.87 指數(shù)階差與乘積對齊 2 1.25 雙模4:2 CSA 2 0.20 雙加法舍入路徑 2 0.53 LZD與規(guī)格化左移 3 1.98 下載: 導(dǎo)出CSV
表 5 雙模浮點融合點乘運算單元性能比較結(jié)果
參數(shù) 文獻[5] 參考結(jié)構(gòu)T2 本文 工藝 (nm) 45 28 28 歸一化面積 (μm2) 12865 (100%) 10336 (80%) 10701 (83%) 工作頻率 (MHz) 1493 500 435 計算延遲 (ns) 2.1 6.0 6.9 功耗 (mW) 16.9 2.7 2.5 功耗×周期 11.3 (100%) 5.3 (47%) 5.6 (50%) 下載: 導(dǎo)出CSV
表 6 雙模浮點FFT整體性能對比
性能參數(shù) 文獻[2] 文獻[9] 文獻[16] 本文 工藝 (nm) 90 65 55 28 FFT結(jié)構(gòu) 基于存儲器 Hybrid MDF MDF FFT點數(shù) 512 1024 1024 128/256/512/1024/2048 并行度 8 1 1 8 數(shù)據(jù)類型:字長 (bit) 塊浮點:12 浮點:32 定點:16 浮點:32/浮點:16 平均輸出SQNR (dB) 57 139 55 SP:135/HP:60 時鐘頻率 (MHz) 324 400 200 435 計算時間 (μs) 0.3 2.6 5.1 0.2/0.3/0.6/1.1/2.2 運算吞吐率 (MSample/s) 2592 400 200 SP:3478/HP:6957 平均功耗 (mW) 42 417 8 104 @435 MHz 有效面積 (mm2) 0.93 1.19 0.15 1.41 歸一化面積 10.0 22.1 1.9 16.0 歸一化吞吐率面積比 259 18 103 SP:220/HP:440 下載: 導(dǎo)出CSV
-
呂倩, 蘇濤. 基于改進型快速雙線性參數(shù)估計的復(fù)雜運動目標ISAR成像[J]. 電子與信息學(xué)報, 2016, 38(9): 2301–2308 doi: 10.11999/JEIT151359Lü Qian and SU Tao. ISAR imaging of targets with complex motion based on the modified fast bilinear parameter estimation[J]. Journal of Electronics&Information Technology, 2016, 38(9): 2301–2308 doi: 10.11999/JEIT151359 HUANG Shenjui and CHEN S. A high-throughput radix-16 FFT processor with parallel and normal input/output ordering for IEEE 802.15.3c systems[J]. IEEE Transactions on Circuits and Systems Ⅰ:Regular Papers, 2012, 59(8): 1752–1765 doi: 10.1109/TCSI.2011.2180430 LAN G and FRANK H. Digital Processing of Synthetic Aperture Radar Data: Algorithms and Implementation[M]. Boston: Artech House Publishers, 2005: 154–210. 陳杰男, 費超, 袁建生, 等. 超高速全并行快速傅里葉變換器[J]. 電子與信息學(xué)報, 2016, 38(9): 2410–2414 doi: 10.11999/JEIT160036CHEN Jienan, FEI Chao, YUAN Jiansheng, et al. An ultra-high-speed fully-parallel fast Fourier transform design[J]. Journal of Electronics&Information Technology, 2016, 38(9): 2410–2414 doi: 10.11999/JEIT160036 JONGWOOK S and EARL E. Improved architectures for a floating-point fused dot product unit[C]. IEEE Symposium on Computer Arithmetic, Austin, USA, 2013: 41–48. doi: 10.1109/ARITH.2013.26. JONGWOOK S and EARL E. Improved architectures for a fused floating-point add-subtract unit[J]. IEEE Transactions on Circuits and Systems Ⅰ:Regular Papers, 2012, 59(10): 2285–2291 doi: 10.1109/TCSI.2012.2188955 CHO T and LEE H. A high-speed low-complexity modified radix-25 FFT processor for high rate WPAN applications[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2013, 21(1): 187–191 doi: 10.1109/TVLSI.2011.2182068 WANG Chao, YAN Yuwei, and FU Xiaoyu. A high-throughput low-complexity radix-24-22-23 FFT/IFFT processor with parallel and normal input/output order for IEEE 802.11ad systems[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2015, 23(11): 2728–2732 doi: 10.1109/TVLSI.2014.2365586 WANG Mingyu and LI Zhaolin. A hybrid SDC/SDF architecture for area and power minimization of floating-point FFT computations[C]. IEEE International Symposium on Circuits and Systems, Montreal, Canada, 2016: 2170–2173. doi: 10.1109/ISCAS.2016.7539011. EARL E and HANI H. FFT implementation with fused floating-point operations[J]. IEEE Transactions on Computers, 2012, 61(2): 284–288 doi: 10.1109/TC.2010.271 TANG S N, TSAI J W, and CHANG T Y. A 2.4-GS/s FFT processor for OFDM-based WPAN applications[J]. IEEE Transactions on Circuits and Systems Ⅱ:Express Briefs, 2010, 57(6): 451–455 doi: 10.1109/TCSII.2010.2048373 NIE Zedong, ZHANG Fengjuan, LI Jie, et al. Low-power digital ASIC for on-chip spectral analysis of low-frequency physiological signals[J]. Journal of Semiconductors, 2012, 33(6): 67–70 doi: 10.1088/1674-4926 IEEE 754-2008. IEEE Standard for Floating-Point Arithmetic[S]. 2008. doi: 10.1109/IEEESTD.2008.5976968. PETER K. Correcting the normalization shift of redundant binary representations[J]. IEEE Transactions on Computers, 2009, 58(10): 1453–1439 doi: 10.1109/TC.2009.38 YANG C H, YU T H, and DEJAN M. Power and area minimization of reconfigurable FFT processors: A 3GPP-LTE example[J]. IEEE Journal of Solid-State Circuits, 2011, 47(3): 757–768 doi: 10.1109/JSSC.2011.2176163 MARIO G, HUANG S J, CHEN S G, et al. The serial commutator FFT[J]. IEEE Transactions on Circuits and Systems Ⅱ:Express Briefs, 2016, 63(10): 974–978 doi: 10.1109/TCSII.2016.2538119 YANG K J, TSAI S H, and CHUANG G. MDC FFT/IFFT processor with variable length for MIMO-OFDM systems[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2013, 21(4): 720–731 doi: 10.1109/TVLSI.2012.2194315 -