一種面向AV1粗模式?jīng)Q策的高吞吐量硬件設(shè)計(jì)方法

盛慶華; 陶澤浩; 黃小芳; 賴昌材; 黃曉峰; 殷海兵; 董哲康

doi:10.11999/JEIT240823

一種面向AV1粗模式?jīng)Q策的高吞吐量硬件設(shè)計(jì)方法

doi: 10.11999/JEIT240823

1.
杭州電子科技大學(xué)電子信息學(xué)院杭州 310018
2.
杭州電子科技大學(xué)信息工程學(xué)院杭州 311305
3.
杭州電子科技大學(xué)通信工程學(xué)院杭州 310018

基金項(xiàng)目: 國(guó)家重點(diǎn)研發(fā)計(jì)劃(2023YFB4502804)

詳細(xì)信息

作者簡(jiǎn)介:
盛慶華：男，副教授，研究方向?yàn)橐曨l編碼、FPGA硬件加速、電子系統(tǒng)集成等

陶澤浩：男，碩士生，研究方向?yàn)橐曨l編解碼、FPGA硬件加速等

黃小芳：女，講師，研究方向?yàn)橐曨l編碼、嵌入式應(yīng)用等

賴昌材：男，高級(jí)工程師，研究方向?yàn)閳D像視頻壓縮、智能處理及其軟硬件加速實(shí)現(xiàn)等

黃曉峰：男，副教授，研究方向?yàn)橐曨l編解碼與芯片架構(gòu)設(shè)計(jì)等

殷海兵：男，教授，研究方向?yàn)閿?shù)字視頻編解碼、多媒體信號(hào)處理、芯片結(jié)構(gòu)設(shè)計(jì)驗(yàn)證等

董哲康：男，副教授，研究方向?yàn)閼涀杵骷皯涀柘到y(tǒng)、人工神經(jīng)網(wǎng)絡(luò)等

通訊作者:
黃小芳　20221016@hdu.edu.cn

中圖分類號(hào): TN919.8
計(jì)量
- 文章訪問(wèn)數(shù): 102
- HTML全文瀏覽量: 25
- PDF下載量: 8
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2024-09-27
- 修回日期: 2025-01-02
- 網(wǎng)絡(luò)出版日期: 2025-01-09

A High-Throughput Hardware Design for AV1 Rough Mode Decision

1.
School of Electronic Information, Hangzhou Dianzi University, Hangzhou 310018, China
2.
School of Information Engineering, Hangzhou Dianzi University, Hangzhou 311305, China
3.
School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China

Funds: The National Key R&D Program of China (2023YFB4502804)

摘要

摘要: 隨著視頻編碼標(biāo)準(zhǔn)的不斷更新迭代，開(kāi)放媒體聯(lián)盟(AOM)發(fā)布最新視頻編碼標(biāo)準(zhǔn)開(kāi)放媒體視頻編碼標(biāo)準(zhǔn)(AV1)。其中，幀內(nèi)編碼技術(shù)采用更加豐富的預(yù)測(cè)模式來(lái)提高預(yù)測(cè)效率，預(yù)測(cè)種類從VP9中的10種擴(kuò)展至61種。為了應(yīng)對(duì)預(yù)測(cè)種類增加的變化并提高硬件的處理吞吐能力，該文提出基于全流水線結(jié)構(gòu)的AV1粗模式?jīng)Q策硬件架構(gòu)設(shè)計(jì)。在算法層面，以4×4塊為最小處理單元，按照Z(yǔ)順序?qū)?4×64編碼樹(shù)單元(CTU)中不同尺寸的預(yù)測(cè)單元(PUs)進(jìn)行粗模式?jīng)Q策，同時(shí)采用基于1:1 PU的代價(jià)累加近似方法來(lái)完成1:2, 1:4, 2:1和4:1 PU的代價(jià)計(jì)算，以減少計(jì)算復(fù)雜度；在硬件層面，設(shè)計(jì)兼容4×4至32×32等多尺寸PU的粗模式?jīng)Q策電路，取代為不同尺寸PU單獨(dú)設(shè)計(jì)電路的方法，有效減少邏輯資源的閑置。實(shí)驗(yàn)結(jié)果表明，在全幀內(nèi)(AI)配置下，提出的改進(jìn)算法相較于AV1標(biāo)準(zhǔn)算法平均節(jié)省了45.78%的時(shí)間，提高了1.94% BD-Rate。同時(shí)，提出的硬件架構(gòu)設(shè)計(jì)能夠在1057個(gè)時(shí)鐘周期內(nèi)完成64×64 CTU的粗模式?jīng)Q策，使用Synopsys公司的Design Compiler 2016工具及UMC 28 nm工藝庫(kù)對(duì)硬件設(shè)計(jì)綜合得到，該設(shè)計(jì)能夠在432.7 MHz工作頻率下實(shí)時(shí)處理8k@50.6fps的視頻。
- 開(kāi)放媒體視頻編碼標(biāo)準(zhǔn) /
- 幀內(nèi)預(yù)測(cè) /
- 粗模式?jīng)Q策 /
- 視頻編碼 /
- 流水線
Abstract: Objective As demand for 4K and 8K Ultra High Definition (UHD) videos increases, the latest generation of video coding standards has been developed to meet the growing need for UHD video transmission. UHD video coding requires processing more pixels and details, resulting in significant increases in computational complexity and resource consumption. Optimizing algorithms and implementing hardware acceleration are essential for achieving real-time encoding and decoding of UHD videos. In Alliance for Open Media Video 1 (AV1), richer intra-prediction modes have been introduced, expanding the number of modes from 10 in VP9 to 61, thereby increasing computational complexity. To address the added complexity of these modes and enhance hardware processing throughput, a hardware design for AV1 Rough Mode Decision (RMD) based on a fully pipelined architecture is proposed. Methods At the algorithm level, a 4×4 block is used as the minimum processing unit. RMD is applied to various sizes of Prediction Units (PUs) within a 64×64 Coding Tree Unit (CTU) following Z-order scanning. This approach allows for efficient processing of large blocks by dividing them into smaller, manageable units. To reduce computational complexity, the SATD cost calculations for different PU sizes (e.g., 1:2, 1:4, 2:1, and 4:1) are performed using a cost accumulation approximation method based on the 1:1 PU. This method minimizes the need to recalculate costs for every possible configuration, thus improving efficiency and reducing computational load. At the hardware level, the architecture supports RMD for PUs of various sizes (4×4 to 32×32) within a 64×64 CTU. This architecture differs from traditional designs, which use separate circuits for each PU size. It optimizes logical resource use and minimizes downtime. The design incorporates a 28-stage pipeline that enables parallel processing of intra-prediction modes, ensuring RMD for at least 16 pixels per clock cycle and significantly enhancing throughput and encoding efficiency. Additionally, the design emphasizes circuit compatibility and reusability across various PU sizes, reducing redundancy and maximizing hardware resource utilization. Results and Discussions Software analysis shows that the proposed AV1 coarse mode decision algorithm reduces processing time by an average of 45.78% compared to the standard AV1 algorithm under the All-Intra (AI) configuration, while achieving a 1.94% improvement in BD-Rate. The testing platform is an Intel(R) Core(TM) i9-9900K CPU @ 3.60 GHz with 16.0 GB of DRAM. Compared to existing methods, the algorithm significantly reduces processing time while maintaining encoding efficiency. It offers an optimized trade-off, with a slight BD-Rate loss in exchange for substantial reductions in encoding time. Hardware analysis reveals that the proposed hardware architecture has a total circuit area of 0.556 mm2 after synthesis, with a maximum operating frequency of 432.7 MHz, enabling real-time encoding of 8k@50.6fps video. Although the circuit area is slightly larger than in existing designs, the architecture demonstrates significant improvements in processing speed and video resolution capability, providing a balanced trade-off between hardware resource usage and throughput/area efficiency. These results further confirm the design's superiority in terms of hardware resource efficiency and processing performance. Conclusions This paper presents a high-throughput hardware design for AV1 RMD, capable of processing all PU sizes with 56 directional and 5 non-directional prediction modes. The design employs a 28-stage pipeline for parallel intra-frame prediction mode processing, enabling RMD for at least 16 pixels per clock cycle and significantly improving encoding efficiency. Techniques such as false-reconstructed reference pixels, Z-order scanning, PMCM circuit structures, and circuit reuse address the increased hardware resource demands of parallel processing. Experimental results show that the proposed algorithm reduces processing time by an average of 45.78% and improves BD-Rate by 1.94% compared to the AV1 standard, ensuring high speed and encoding quality. Circuit synthesis confirms the architecture's capability for real-time 8k@50.6fps video processing, meeting the demands of future UHD video encoding with exceptional performance and efficiency.
- Alliance for Open Media Video 1 (AV1) /
- Intra prediction /
- Rough Mode Decision (RMD) /
- Video coding /
- Pipeline

HTML全文

圖 1 RMD硬件總體架構(gòu)設(shè)計(jì)

下載: 全尺寸圖片幻燈片

圖 2 硬件實(shí)現(xiàn)RMD流程圖

下載: 全尺寸圖片幻燈片

圖 3 整體架構(gòu)時(shí)空?qǐng)D

下載: 全尺寸圖片幻燈片

圖 4 4×4 PU參考像素填充情況

下載: 全尺寸圖片幻燈片

圖 5 輸入順序示意圖

下載: 全尺寸圖片幻燈片

圖 6 方向性模式硬件設(shè)計(jì)

下載: 全尺寸圖片幻燈片

圖 7 DC模式硬件設(shè)計(jì)

下載: 全尺寸圖片幻燈片

圖 8 平滑模式硬件設(shè)計(jì)

下載: 全尺寸圖片幻燈片

圖 9 平滑模式權(quán)重PMCM硬件設(shè)計(jì)

下載: 全尺寸圖片幻燈片

圖 10 Paeth模式硬件設(shè)計(jì)

下載: 全尺寸圖片幻燈片

圖 11 4×4 PU的SATD代價(jià)計(jì)算硬件設(shè)計(jì)

下載: 全尺寸圖片幻燈片

圖 12 長(zhǎng)度為8的亂序列雙調(diào)排序示例

下載: 全尺寸圖片幻燈片

圖 13 輸入序列長(zhǎng)度為8的雙調(diào)排序硬件設(shè)計(jì)

下載: 全尺寸圖片幻燈片

表 1 改進(jìn)算法與AV1標(biāo)準(zhǔn)算法的性能比較(%)

測(cè)試序列	BD-Rate	TS
A1(UHD 4K)	2.21	49.2
A2(UHD 4K)	1.77	46.4
B(1080P)	1.93	48.1
C(480P)	2.23	38.4
E(720P)	1.56	46.8
平均結(jié)果	1.94	45.78

下載: 導(dǎo)出CSV

表 2 本文改進(jìn)算法與現(xiàn)有工作比較(%)

文獻(xiàn)	BD-Rate	TS
[33]	1.28	29.80
[34]	7.41	50.19
[35]	0.60	15.36
本文	1.94	45.78

下載: 導(dǎo)出CSV

表 3 基于ASIC實(shí)現(xiàn)的RMD相關(guān)硬件設(shè)計(jì)工作對(duì)比

對(duì)比指標(biāo)	文獻(xiàn)[36]	文獻(xiàn)[37]	文獻(xiàn)[38]	文獻(xiàn)[39]	本文
工藝	TSMC 40 nm	TSMC 40 nm	TSMC 40 nm	TSMC 40 nm	UMC 28 nm
門電路(Kgates)	455.8	821.8	584.8	128.5	1011.3
工作頻率(MHz)	1,296	1,902	1,296	648	432.7
時(shí)鐘周期(Cycle)	7104	7104	7104	7104	1057
功耗(mW)	40.9	1613.3	4110.0	65.5	1891.6
吞吐量	4k@60fps	4k@60fps	4k@60fps	4k@30fps	8k@50.6fps
吞吐量/面積(px/gate)	1091.85	605.55	850.93	1936.44	1660.03
非方向性預(yù)測(cè)	×	×	×	√	√
方向性預(yù)測(cè)	√	√	√	×	√
模式?jīng)Q策	×	×	×	×	√

下載: 導(dǎo)出CSV

參考文獻(xiàn)(39)

[1]	BENDER I, BORGES A, AGOSTINI L, et al. Complexity and compression efficiency analysis of libaom AV1 video codec[J]. Journal of Real-Time Image Processing, 2023, 20(3): 50. doi: 10.1007/s11554-023-01308-5.
[2]	REN Huiwen, WANG Shanshe, MA Siwei, et al. SVT-AVS3: An open-source high-performance AVS3 encoder with scalable video technology[J]. IEEE Transactions on Multimedia, 2024, 26: 3291–3301. doi: 10.1109/TMM.2023.3309549.
[3]	LEE M, SONG H J, PARK J, et al. Overview of versatile video coding (H. 266/VVC) and its coding performance analysis[J]. IEIE Transactions on Smart Processing & Computing, 2023, 12(2): 122–154. doi: 10.5573/IEIESPC.2023.12.2.122.
[4]	MUKHERJEE D, HAN Jingning, BANKOSKI J, et al. A technical overview of VP9—the latest open-source video codec[J]. SMPTE Motion Imaging Journal, 2015, 124(1): 44–54. doi: 10.5594/j18499.
[5]	林浩, 饒豐. AV1視頻編碼標(biāo)準(zhǔn)在我國(guó)的發(fā)展趨勢(shì)分析[J]. 廣播電視信息, 2023, 30(2): 62–64. doi: 10.16045/j.cnki.rti.2023.02.022. LIN Hao and RAO Feng. Analysis on the development trend of AV1 video coding standard in China[J]. Radio & Television Information, 2023, 30(2): 62–64. doi: 10.16045/j.cnki.rti.2023.02.022.
[6]	杜紅青. 下一代視頻編碼高效幀內(nèi)預(yù)測(cè)算法研究[D]. [碩士論文], 西安電子科技大學(xué), 2023. doi: 10.27389/d.cnki.gxadu.2023.001917. DU Hongqing. Research on high efficiency intra prediction algorithm for next generation video coding[D]. [Master dissertation], Xidian University, 2023. doi: 10.27389/d.cnki.gxadu.2023.001917.
[7]	GROIS D, GILADI A, CHOI K, et al. Performance comparison of emerging EVC and VVC video coding standards with HEVC and AV1[J]. SMPTE Motion Imaging Journal, 2021, 130(4): 1–12. doi: 10.5594/JMI.2021.3065442.
[8]	UHRINA M, SEVCIK L, BIENIK J, et al. Performance comparison of VVC, AV1, HEVC, and AVC for high resolutions[J]. Electronics, 2024, 13(5): 953. doi: 10.3390/electronics13050953.
[9]	劉暢, 賈克斌, 劉鵬宇. 基于多分支網(wǎng)絡(luò)的深度圖幀內(nèi)編碼單元快速劃分算法[J]. 電子與信息學(xué)報(bào), 2022, 44(12): 4357–4366. doi: 10.11999/JEIT211010. LIU Chang, JIA Kebin, and LIU Pengyu. Fast partition algorithm in depth map intra-frame coding unit based on multi-branch network[J]. Journal of Electronics & Information Technology, 2022, 44(12): 4357–4366. doi: 10.11999/JEIT211010.
[10]	WANG Yizhao, ZHANG Chaobo, and SUN Songlin. Intra prediction fast algorithm in AVS3 based on image texture characteristics[C]. 2021 20th International Symposium on Communications and Information Technologies, Tottori, Japan, 2021: 6–10. doi: 10.1109/ISCIT52804.2021.9590620.
[11]	ZHANG Yongfei, LI Zhe, and LI Bo, et al. Gradient-based fast decision for intra prediction in HEVC[C]. 2012 Visual Communications and Image Processing, San Diego, USA, 2012: 1–6. doi: 10.1109/VCIP.2012.6410739.
[12]	ZHU Linwei, ZHANG Yun, Li Na, et al. Deep learning-based intra mode derivation for versatile video coding[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(2s): 96. doi: 10.1145/356369.
[13]	DUARTE A, ZATT B, CORREA G, et al. Fast intra mode decision using machine learning for the versatile video coding standard[C]. 2023 IEEE International Symposium on Circuits and Systems, Monterey, USA, 2023: 1–5. doi: 10.1109/ISCAS46773.2023.10181769.
[14]	STORCH I, ROMA N, PALOMINO D, et al. GPU acceleration of MIP intra prediction in VVC[C]. 2023 31st European Signal Processing Conference, Helsinki, Finland, 2023: 600–604. doi: 10.23919/EUSIPCO58844.2023.10290037.
[15]	HAN Xu, WANG Shanshe, MA Siwei, et al. Optimization of motion compensation based on GPU and CPU for VVC decoding[C]. 2020 IEEE International Conference on Image Processing, Abu Dhabi, United Arab Emirates, 2020: 1196–1200. doi: 10.1109/ICIP40778.2020.9190708.
[16]	CORRêA M, WASKOW B, ZATT B, et al. High throughput hardware design for AV1 Paeth and smooth intra modes[C]. 2019 IEEE International Symposium on Circuits and Systems, Sapporo, Japan, 2019: 1–5. doi: 10.1109/ISCAS.2019.8702258.
[17]	CAI Zhanyuan and GAO Wei. Efficient fast algorithm and parallel hardware architecture for intra prediction of AVS3[C]. 2021 IEEE International Symposium on Circuits and Systems, Daegu, South Korea, 2021: 1–5. doi: 10.1109/ISCAS51556.2021.9401121.
[18]	HUANG Xiaofeng, JIA Huizhu, CAI Binbin, et al. Fast algorithms and VLSI architecture design for HEVC intra-mode decision[J]. Journal of Real-Time Image Processing, 2016, 12(2): 285–302. doi: 10.1007/s11554-015-0549-8.
[19]	CORRêA M, WASKOW B, GOEBEL J, et al. A high throughput hardware architecture targeting the AV1 Paeth intra predictor[C]. 2019 IEEE 10th Latin American Symposium on Circuits & System, Armenia, Colombia, 2019: 93–96. doi: 10.1109/LASCAS.2019.8667544.
[20]	劉鵬宇, 張悅, 賈克斌, 等. 基于局部亮度直方圖的自適應(yīng)視頻幀類型決策算法[J]. 電子與信息學(xué)報(bào), 2023, 45(1): 300–307. doi: 10.11999/JEIT211199. LIU Pengyu, ZHANG Yue, JIA Kebin, et al. Adaptive video frame type decision algorithm based on local luminance histogram[J]. Journal of Electronics & Information Technology, 2023, 45(1): 300–307. doi: 10.11999/JEIT211199.
[21]	SU Weitong, XIANG Guoqing, HUANG Xiaofeng, et al. Fast algorithm and VLSI architecture design of rough mode decision for AVS3[C]. 2023 IEEE International Conference on Consumer Electronic, Las Vegas, USA, 2023: 1–4. doi: 10.1109/ICCE56470.2023.10043565.
[22]	齊美彬, 陳秀麗, 楊艷芳, 等. 高效率視頻編碼幀內(nèi)預(yù)測(cè)編碼單元?jiǎng)澐挚焖偎惴╗J]. 電子與信息學(xué)報(bào), 2014, 36(7): 1699–1705. doi: 10.3724/SP.J.1146.2013.01148. QI Meibin, CHEN Xiuli, and YANG Yanfang. Fast coding unit splitting algorithm for high efficiency video coding intra prediction[J]. Journal of Electronics & Information Technology, 2014, 36(7): 1699–1705. doi: 10.3724/SP.J.1146.2013.01148.
[23]	CHEN Yue, MUKHERJEE D, HAN Jingning, et al. An overview of coding tools in AV1: The first video codec from the alliance for open media[J]. APSIPA Transactions on Signal and Information Processing, 2020, 9(1): e6. doi: 10.1017/ATSIP.2020.2.
[24]	HAKKENNES E A and VASSILIADIS S. Hardwired Paeth codec for portable network graphics (PNG)[C]. Proceedings 25th EUROMICRO Conference. Informatics: Theory and Practice for the New Millennium, Milan, Italy, 1999: 318–325. doi: 10.1109/EURMIC.1999.794796.
[25]	PAETH A W. Image file compression made easy[M]. ARVO J. Graphics Gems II. Amsterdam: Elsevier, 1991: 93–100. doi: 10.1016/B978-0-08-050754-5.50029-3.
[26]	STORCH I, ROMA N, PALOMINO D, et al. Alternative reference samples to improve coding efficiency for parallel intra prediction solutions[C]. 2024 IEEE 15th Latin America Symposium on Circuits and Systems, Punta del Este, Uruguay, 2024: 1–5. doi: 10.1109/LASCAS60203.2024.10506142.
[27]	KUMM M. Multiple Constant Multiplication Optimizations for Field Programmable Gate Arrays[M]. Wiesbaden: Springer, 2016. doi: 10.1007/978-3-658-13323-8.
[28]	LIACHA A, OUDJIDA A K, BAKIRI M, et al. Radix-2^r recoding with common subexpression elimination for multiple constant multiplication[J]. IET Circuits, Devices & Systems, 2020, 14(7): 990–994. doi: 10.1049/iet-cds.2020.0213.
[29]	MOHAMED H, ELLIETHY A, ABDELAZIZ A, et al. Real-time motion estimation based video steganography with preserved consistency and local optimality[J]. Multimedia Tools and Applications, 2024: 1–24. doi: 10.1007/s11042-024-18651-9.
[30]	CHEN Shushi, HUANG Leilei, LIU Jiahao, et al. An error-surface-based fractional motion estimation algorithm and hardware implementation for VVC[C]. 2023 IEEE International Symposium on Circuits and Systems, Monterey, USA, 2023: 1–5. doi: 10.1109/ISCAS46773.2023.10182170.
[31]	YANG Mouzhi, ZHANG Peng, FANG Jianbin, et al. thSORT: An efficient parallel sorting algorithm on multi-core DSPs[J]. CCF Transactions on High Performance Computing, 2024, 6(5): 503–518. doi: 10.1007/s42514-023-00175-7.
[32]	ESMAILI-DOKHT P, GUIOT M, RADOJKOVI? P, et al. O(n) key–value sort with active compute memory[J]. IEEE Transactions on Computers, 2024, 73(5): 1341–1356. doi: 10.1109/TC.2024.3371773.
[33]	CORRêA M M. Heuristic-based algorithms and hardware designs for fast intra-picture prediction in AV1 video coding[D]. [Ph. D. dissertation], Universidade Federal de Pelotas, 2023.
[34]	ROSA P, PALOMINO D, PORTO M, et al. GM-RF: An AV1 intra-frame fast decision based on random forest[C]. 2022 IEEE International Conference on Image Processing, Bordeaux, France, 2022: 3556–3560. doi: 10.1109/ICIP46576.2022.9897488.
[35]	CORRêA M, ROMA N, PALOMINO D, et al. Mode-adaptive subsampling of SAD/SSE operations for intra prediction cost reduction[C]. 2022 IEEE International Symposium on Circuits and Systems, Austin, USA, 2022: 1808–1812. doi: 10.1109/ISCAS48785.2022.9937507.
[36]	CORRěA M, NETO L, PALOMINO D, et al. ASIC solution for the directional intra prediction of the AV1 encoder targeting UHD 4K videos[C]. 2020 IEEE International Symposium on Circuits and Systems, Seville, Spain, 2020: 1–5. doi: 10.1109/ISCAS45731.2020.9180526.
[37]	NETO L, CORRêA M, PALOMINO D, et al. Directional intra frame prediction architecture with edge filter and upsampling for AV1 video coding[C]. 2020 33rd Symposium on Integrated Circuits and Systems Design, Campinas, Brazil, 2020: 1–6. doi: 10.1109/SBCCI50935.2020.9189902.
[38]	NETO L, CORREA M, PALOMINO D, et al. Exploring operation sharing in directional intra frame prediction of AV1 video coding[C]. 2021 IEEE 12th Latin America Symposium on Circuits and System, Arequipa, Peru, 2021: 1–4. doi: 10.1109/LASCAS51355.2021.9459136.
[39]	CORRêA M M, WASKOW B H, GOEBEL J W, et al. A high-throughput hardware architecture for AV1 non-directional intra modes[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2020, 67(5): 1481–1494. doi: 10.1109/TCSI.2020.2973031.