基于多尺度特征增強與全局-局部特征聚合的視頻目標分割算法

侯志強; 董佳樂; 馬素剛; 王晨旭; 楊小寶; 王昀琛

doi:10.11999/JEIT231394

基于多尺度特征增強與全局-局部特征聚合的視頻目標分割算法

doi: 10.11999/JEIT231394 cstr: 32379.14.JEIT231394

1.
西安郵電大學計算機學院西安 710121
2.
西安郵電大學陜西省網絡數(shù)據(jù)分析與智能處理實驗室西安 710121

基金項目: 國家自然科學基金(62072370)，陜西省自然科學基金(2023-JC-YB-598)

詳細信息

作者簡介:
侯志強：男，博士，教授，研究方向為計算機視覺、目標跟蹤等

董佳樂：男，碩士生，研究方向為計算機視覺、視頻目標分割等

馬素剛：男，博士，教授，研究方向為計算機視覺、機器學習等

王晨旭：男，碩士生，研究方向為計算機視覺、視頻目標分割等

楊小寶：男，博士，講師，研究方向為計算機圖形學、人工智能等

王昀?。号┦?，講師，研究方向為計算機圖形學、圖像分類等

通訊作者:
董佳樂　djl112299@163.com

中圖分類號: TN911.73; TP391.41
計量
- 文章訪問數(shù): 338
- HTML全文瀏覽量: 246
- PDF下載量: 86
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2023-12-18
- 修回日期: 2024-09-25
- 網絡出版日期: 2024-09-30
- 刊出日期: 2024-11-10

Video Object Segmentation Algorithm Based on Multi-scale Feature Enhancement and Global-Local Feature Aggregation

1.
Institute of Computer, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
2.
Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

Funds: The National Natural Science Foundation of China (62072370), The Natural Science Foundation of Shaanxi Province (2023-JC-YB-598)

摘要

摘要: 針對記憶網絡算法中多尺度特征表達能力不足和淺層特征沒有充分利用的問題，該文提出一種多尺度特征增強與全局-局部特征聚合的視頻目標分割(VOS)算法。首先，通過多尺度特征增強模塊融合可參考掩碼分支和可參考RGB分支的不同尺度特征信息，增強多尺度特征的表達能力；同時，建立了全局-局部特征聚合模塊，利用不同大小感受野的卷積操作來提取特征，并通過特征聚合模塊來自適應地融合全局區(qū)域和局部區(qū)域的特征，這種融合方式可以更好地捕捉目標的全局特征和細節(jié)信息，提高分割的準確性；最后，設計了跨層融合模塊，利用淺層特征的空間細節(jié)信息來提升分割掩碼的精度，通過將淺層特征與深層特征融合，能更好地捕捉目標的細節(jié)和邊緣信息。實驗結果表明，在公開數(shù)據(jù)集DAVIS2016, DAVIS2017和YouTube-2018上，該文算法的綜合性能分別達到91.8%、84.5%和83.0%，在單目標和多目標分割任務上都能實時運行。
- 視頻目標分割 /
- 記憶網絡 /
- 孿生網絡 /
- 特征融合 /
- 掩碼細化
Abstract: To address the issues of insufficient multi-scale feature expression ability and insufficient utilization of shallow features in memory network algorithms, a Video Object Segmentation (VOS) algorithm based on multi-scale feature enhancement and global local feature aggregation is proposed in this paper. Firstly, the multi-scale feature enhancement module fuses different scale feature information from reference mask branches and reference RGB branches to enhance the expression ability of multi-scale features; At the same time, a global local feature aggregation module is established, which utilizes convolution operations of different sizes of receptive fields to extract features, through the feature aggregation module, the features of the global and local regions are adaptively fused. This fusion method can better capture the global features and detailed information of the target, improving the accuracy of segmentation; Finally, a cross layer fusion module is designed to improve the accuracy of masks segmentation by utilizing the spatial details of shallow features. By fusing shallow features with deep features, it can better capture the details and edge information of the target. The experimental results show that on the public datasets DAVIS2016, DAVIS2017, and YouTube 2018, the comprehensive performance of our algorithm reaches 91.8%, 84.5%, and 83.0%, respectively, and can run in real-time on both single and multi-objective segmentation tasks.
- Video Object Segmentation (VOS) /
- Memory network /
- Siamese network /
- Feature fusion /
- Mask refinement

HTML全文

圖 1 多尺度特征增強與全局-局部特征聚合的視頻目標分割算法整體框架

下載: 全尺寸圖片幻燈片

圖 2 多尺度特征增強模塊

下載: 全尺寸圖片幻燈片

圖 3 全局-局部特征聚合模塊

下載: 全尺寸圖片幻燈片

圖 4 跨層融合模塊

下載: 全尺寸圖片幻燈片

圖 5 本文算法在DAVIS2016和 DAVIS2017驗證集上與近年算法的性能和速度比較

下載: 全尺寸圖片幻燈片

圖 6 本文算法與對比算法在DAVIS2017數(shù)據(jù)集上的部分分割結果比較

下載: 全尺寸圖片幻燈片

圖 7 本文算法在DAVIS2017數(shù)據(jù)集和YouTube-2018數(shù)據(jù)集的部分定性結果展示

下載: 全尺寸圖片幻燈片

表 1 DAVIS2016和DAVIS2017驗證集不同算法的性能比較

算法	來源	DAVIS2016					DAVIS2017
算法	來源	J&F	J	F	速度(fps)	時間(s)	J&F	J	F	速度(fps)	時間(s)
OSVOS ^[5]	CVPR2017	80.2	79.8	80.6	0.10	10.00	60.3	56.6	63.9	0.1	10.00
OnAVOS^[7]	CVPRW2017	85.5	86.1	84.9	0.08	12.50	63.6	61.0	66.1	0.05	22.0
OSVOS-S^[25]	TPAMI2018	86.6	85.6	87.5	0.20	5.00	68.0	64.7	71.3	0.1	10.00
OSNM^[26]	CVPR2018	73.5	74	72.9	7.70	0.13	54.8	52.5	57.1	7.0	0.14
FAVOS^[27]	CVPR2018	82.4	79.5	80.9	0.60	1.67	58.2	54.6	61.8	5.6	0.18
AGAME^[14]	CVPR2019	82.1	82.0	82.2	14.00	0.07	70.0	67.4	72.6	14.0	0.07
RANet^[28]	ICCV2019	85.5	85.5	85.4	33.00	0.03	65.7	63.2	68.2	33.0	0.03
FTMU^[29]	CVPR2020	78.9	77.5	80.3	11.00	0.09	70.6	69.1	72.1	11.0	0.09
SSM^[19]	T-CSVT2021	85.9	86.2	85.6	37.00	0.03	77.6	75.3	79.9	--	--
TMO^[20]	TCSVT2023	86.1	85.6	86.6	43.20	0.02	72.3	69.9	74.7	37.0	0.03
STM^[11]	ICCV2019	89.3	88.7	89.9	10.30	0.10	81.8	79.2	84.3	8.8	0.11
FRTM^[21]	CVPR2020	83.6	83.7	83.4	21.9	0.05	76.7	73.8	79.6	21.9	0.05
GC^[15]	ECCV2020	86.6	87.6	85.7	25.00	0.04	71.4	69.3	73.5	--	--
KMN^[16]	ECCV2020	90.5	89.5	83.6	9.00	0.11	82.8	80.0	85.6	8.0	0.13
TransVOS^[22]	CVPR2021	90.5	89.8	91.2	--	--	83.9	81.4	86.4	--	--
MTMFI^[23]	Neurocomputing2022	85.2	84.9	85.5	13.70	0.07	77.6	74.6	80.6	13.7	0.07
ILTR^[24]	計算機學報2022	84.6	84.9	84.3	18.00	0.06	72.9	70.0	75.8	--	--
KMN^M[17]	TPAMI2022	91.2	90.2	92.1	8.00	0.13	83.5	80.9	86.1	8.0	0.13
LLB^[30]	AAAI2023	--	--	--	--	--	84.6	81.5	87.7	8.3	0.12
MGLAS	本文	91.8	90.6	93.0	33.45	0.03	84.5	81.6	87.3	26.6	0.04

下載: 導出CSV

表 2 YouTube-2018驗證集不同算法的性能比較

算法	來源	G	J_s	J_u	F_s	F_u
MSK^[13]	CVPR2017	53.1	59.9	45.0	59.5	47.9
OnAVOS^[7]	CVPRW2017	55.2	60.1	46.6	62.7	51.4
OSVOS^[5]	CVPR2017	58.8	59.8	54.2	60.5	60.7
OSNM^[26]	CVPR2018	51.2	60.0	40.6	60.1	44.0
RGMP^[8]	CVPR2018	53.8	59.5	45.2	--	--
AGAME^[14]	CVPR2019	66.0	66.9	61.2	--	--
STM^[11]	ICCV2019	78.9	78.6	73.3	82.8	80.9
FRTM^[21]	CVPR2020	65.7	68.6	58.4	71.3	64.5
SSM^[19]	T-CSVT2021	66.5	72.3	57.8	73.3	62.6
TranVOS^[22]	CVPR2021	81.8	82.0	75.0	86.7	83.4
ILTR^[24]	計算機學報2022	73.8	73.9	67.5	77.9	75.7
KMN^M[17]	TPAMI2022	81.4	81.4	75.3	85.6	83.3
LLB^[30]	AAAI2023	83.8	82.1	79.1	87.0	87.0
MGLAS	本文	83.0	81.9	77.9	86.5	85.7

下載: 導出CSV

表 3 本文算法在DAVIS2017驗證集上的消融實驗

基準算法	MFEM	GLFAM	CFM	J&F	J	F
√				81.8	79.2	84.3
√	√			83.2	79.9	86.5
√		√		83.5	80.6	86.4
√			√	83.5	80.0	86.9
√	√	√	√	84.5	81.6	87.3

下載: 導出CSV

參考文獻(30)

[1]	ERDéLYI A, BARáT T, VALET P, et al. Adaptive cartooning for privacy protection in camera networks[C]. 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Seoul, Korea (South), 2014: 44–49. doi: 10.1109/AVSS.2014.6918642.
[2]	WANG Wenguan, SHEN Jianbing, PORIKLI F, et al. Semi-supervised video object segmentation with super-trajectories[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(4): 985–998. doi: 10.1109/TPAMI.2018.2819173.
[3]	SALEH K, HOSSNY M, and NAHAVANDI S. Kangaroo vehicle collision detection using deep semantic segmentation convolutional neural network[C]. 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, Australia, 2016: 1–7. doi: 10.1109/DICTA.2016.7797057.
[4]	LU Xiankai, WANG Wenguan, SHEN Jianbing, et al. Learning video object segmentation from unlabeled videos[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 8957–8967. doi: 10.1109/CVPR42600.2020.00898.
[5]	CAELLES S, MANINIS K K, PONT-TUSET J, et al. One-shot video object segmentation[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 5320–5329. doi: 10.1109/CVPR.2017.565.
[6]	CHENG H K, TAI Y W, and TANG C K. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 5555–5564. doi: 10.1109/CVPR46437.2021.00551.
[7]	VOIGTLAENDER P and LEIBE B. Online adaptation of convolutional neural networks for video object segmentation[C]. British Machine Vision Conference 2017, London, UK, 2017.
[8]	OH S W, LEE J Y, SUNKAVALLI K, et al. Fast video object segmentation by reference-guided mask propagation[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7376–7385. doi: 10.1109/CVPR.2018.00770.
[9]	徐金東, 趙甜雨, 馮國政, 等. 基于上下文模糊C均值聚類的圖像分割算法[J]. 電子與信息學報, 2021, 43(7): 2079–2086. doi: 10.11999/JEIT200263. XU Jindong, ZHAO Tianyu, FENG Guozheng, et al. Image segmentation algorithm based on context fuzzy C-means clustering[J]. Journal of Electronics & Information Technology, 2021, 43(7): 2079–2086. doi: 10.11999/JEIT200263.
[10]	杭昊, 黃影平, 張栩瑞, 等. 面向道路場景語義分割的移動窗口變換神經網絡設計[J]. 光電工程, 2024, 51(1): 230304. doi: 10.12086/oee.2024.230304. HANG Hao, HUANG Yingping, ZHANG Xurui, et al. Design of swin transformer for semantic segmentation of road scenes[J]. Opto-Electronic Engineering, 2024, 51(1): 230304. doi: 10.12086/oee.2024.230304.
[11]	OH S W, LEE J Y, XU Ning, et al. Video object segmentation using space-time memory networks[C]. The IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 9225–9234. doi: 10.1109/ICCV.2019.00932.
[12]	LUITEN J, VOIGTLAENDER P, and LEIBE B. PReMVOS: Proposal-generation, refinement and merging for video object segmentation[C]. 14th Asian Conference on Computer Vision, Perth, Australia, 2019: 565–580. doi: 10.1007/978-3-030-20870-7_35.
[13]	PERAZZI F, KHOREVA A, BENENSON R, et al. Learning video object segmentation from static images[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 3491–3500. doi: 10.1109/CVPR.2017.372.
[14]	JOHNANDER J, DANELLJAN M, BRISSMAN E, et al. A generative appearance model for end-to-end video object segmentation[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA: 2019: 8945–8954. doi: 10.1109/CVPR.2019.00916.
[15]	LI Yu, SHEN Zhuoran, and SHAN Ying. Fast video object segmentation using the global context module[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 735–750. doi: 10.1007/978-3-030-58607-2_43.
[16]	SEONG H, HYUN J, and KIM E. Kernelized memory network for video object segmentation[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 629–645. doi: 10.1007/978-3-030-58542-6_38.
[17]	SEONG H, HYUN J, and KIM E. Video object segmentation using Kernelized memory network with multiple kernels[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(2): 2595–2612. doi: 10.1109/TPAMI.2022.3163375.
[18]	KINGMA D P and BA J. Adam: A method for stochastic optimization[C]. 3rd International Conference on Learning Representations, San Diego, USA, 2015.
[19]	ZHU Wencheng, LI Jiahao, LU Jiwen, et al. Separable structure modeling for semi-supervised video object segmentation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(1): 330–344. doi: 10.1109/TCSVT.2021.3060015.
[20]	CHO S, LEE M, LEE S, et al. Treating motion as option to reduce motion dependency in unsupervised video object segmentation[C]. The IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2023: 5129–5138. doi: 10.1109/WACV56688.2023.00511.
[21]	ROBINSON A, LAWIN F J, DANELLJAN M, et al. Learning fast and robust target models for video object segmentation[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 7404–7413. doi: 10.1109/CVPR42600.2020.00743.
[22]	MEI Jianbiao, WANG Mengmeng, LIN Yeneng, et al. TransVOS: Video object segmentation with transformers[J]. arXiv: 2106.00588, 2021. doi: 10.48550/arXiv.2106.00588.
[23]	GAO Bocong, ZHAO Yuqian, ZHANG Fan, et al. Video object segmentation based on multi-level target models and feature integration[J]. Neurocomputing, 2022, 492: 396–407. doi: 10.1016/j.neucom.2022.04.042.
[24]	徐凱, 李國榮, 洪德祥, 等. 結合在線歸納和直推推理的快速視頻目標分割方法[J]. 計算機學報, 2022, 45(10): 2117–2132. doi: 10.11897/SP.J.1016.2022.02117. XU Kai, LI Guorong, HONG Dexiang, et al. A fast video object segmentation method based on inductive learning and transductive reasoning[J]. Chinese Journal of Computers, 2022, 45(10): 2117–2132. doi: 10.11897/SP.J.1016.2022.02117.
[25]	MANINIS K K, CAELLES S, CHEN Yuhua, et al. Video object segmentation without temporal information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(6): 1515–1530. doi: 10.1109/TPAMI.2018.2838670.
[26]	YANG Linjie, WANG Yanran, XIONG Xuehan, et al. Efficient video object segmentation via network modulation[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6499–6507. doi: 10.1109/CVPR.2018.00680.
[27]	CHENG Jingchun, TSAI Y H, HUNG W C, et al. Fast and accurate online video object segmentation via tracking parts[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7415–7424. doi: 10.1109/CVPR.2018.00774.
[28]	WANG Ziqin, XU Jun, LIU Li, et al. RANet: Ranking attention network for fast video object segmentation[C]. The IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 3977–3986. doi: 10.1109/ICCV.2019.00408.
[29]	SUN Mingjie, XIAO Jimin, LIM E G, et al. Fast template matching and update for video object tracking and segmentation[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 10788–10796. doi: 10.1109/CVPR42600.2020.01080.
[30]	LAN Meng, ZHANG Jing, ZHANG Lefei, et al. Learning to learn better for video object segmentation[C]. The AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 1205–1212. doi: 10.1609/aaai.v37i1.25203.