跨層融合與多模型投票的動作識別

羅會蘭; 盧飛; 嚴(yán)源

doi:10.11999/JEIT180373

跨層融合與多模型投票的動作識別

doi: 10.11999/JEIT180373

江西理工大學(xué)信息工程學(xué)院 ??贛州 ??341000

基金項目: 國家自然科學(xué)基金(61462035, 61862031)，江西省青年科學(xué)家培養(yǎng)項目(20153BCB23010)，江西省自然科學(xué)基金(20171BAB202014)

詳細(xì)信息

作者簡介:
羅會蘭：女，1974年生，博士，教授，研究方向為機器學(xué)習(xí)和模式識別等

盧飛：男，1994年生，碩士生，研究方向為視頻中的動作識別、圖像語義分割等

嚴(yán)源：男，1991年生，碩士生，研究方向為視頻中的動作識別等

通訊作者:
羅會蘭　luohuilan@sina.com

中圖分類號: TP391.4
計量
- 文章訪問數(shù): 1858
- HTML全文瀏覽量: 783
- PDF下載量: 70
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2018-04-24
- 修回日期: 2018-11-02
- 網(wǎng)絡(luò)出版日期: 2018-11-12
- 刊出日期: 2019-03-01

Action Recognition Based on Multi-model Voting with Cross Layer Fusion

School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China

Funds: The National Natural Science Foundation of China (61462035, 61862031), The Young Scientist Training Project of Jiangxi Province (20153BCB23010), The Natural Science Foundation of Jiangxi Province (20171BAB202014)

摘要

摘要:
針對動作特征在卷積神經(jīng)網(wǎng)絡(luò)模型傳輸時的損失問題以及網(wǎng)絡(luò)模型過擬合的問題，該文提出一種跨層融合模型和多個模型投票的動作識別方法。在預(yù)處理階段，借助排序池化的方法聚集視頻中的運動信息，生成近似動態(tài)圖像。在全連接層前設(shè)置對特征信息進(jìn)行水平翻轉(zhuǎn)結(jié)構(gòu)，構(gòu)成無融合模型。在無融合模型的基礎(chǔ)上添加第2層的輸出特征與第5層的輸出特征融合結(jié)構(gòu)，構(gòu)造成跨層融合模型。訓(xùn)練時，對無融合模型和跨層融合模型兩種基本模型采用3種數(shù)據(jù)劃分方式以及兩種生成近似動態(tài)圖像順序進(jìn)行訓(xùn)練，得到多個不同的分類器。測試時使用多個分類器進(jìn)行預(yù)測，對它們得到的結(jié)果進(jìn)行投票集成，作為最終分類結(jié)果。在UCF101數(shù)據(jù)集上，提出的無融合模型和跨層融合模型的識別方法與動態(tài)圖像網(wǎng)絡(luò)模型的方法相比，識別率有較大提高；多模型投票的識別方法能有效緩解模型的過擬合現(xiàn)象，增加算法的魯棒性，得到更好的平均性能。
- 動作識別 /
- 跨層融合 /
- 多模型投票 /
- 近似動態(tài)圖像 /
- 水平翻轉(zhuǎn)
Abstract:
To solve the problem of the loss in the motion features during the transmission of deep convolution neural networks and the overfitting of the network model, a cross layer fusion model and a multi-model voting action recognition method are proposed. In the preprocessing stage, the motion information in a video is gathered by the rank pooling method to form approximate dynamic images. Two basic models are presented. One model with two horizontally flipping layers is called " non-fusion model”, and then a fusion structure of the second layer and the fifth layer is added to form a new model named " cross layer fusion model”. The two basic models of " non-fusion model” and " cross layer fusion model” are trained respectively on three different data partitions. The positive and negative sequences of each video are used to generate two approximate dynamic images. So many different classifiers can be obtained by training the two proposed models using different training approximate dynamic images. In testing, the final classification results can be obtained by averaging the results of all these classifiers. Compared with the dynamic image network model, the recognition rate of the non-fusion model and the cross layer fusion model is greatly improved on the UCF101 dataset. The multi-model voting method can effectively alleviate the overfitting of the model, increase the robustness of the algorithm and get better average performance.
- Action recognition /
- Cross layer fusion /
- Multi-models voting /
- Approximate dynamic image /
- Horizontal flip

HTML全文

圖 1 無融合模型

下載: 全尺寸圖片幻燈片

圖 2 跨層融合模型

下載: 全尺寸圖片幻燈片

表 1 4種不同權(quán)重融合模型的平均識別準(zhǔn)確度(%)

模型	融合0.50	融合0.25	融合0.20	融合0.10
平均準(zhǔn)確度	53.89	63.12	63.94	64.82

下載: 導(dǎo)出CSV

表 2 跨層融合模型動作識別準(zhǔn)確度(%)

動作類	轉(zhuǎn)呼啦圈	鍵盤打字	軍隊行進(jìn)	彈吉他	擲鐵餅	類平均
split1+正序	87.14	80.40	${\underline{87.14}}$	${\underline{91.33}}$	${\underline{77.45}}$	82.47
split1+反序	${\underline{86.29}}$	79.63	87.90	91.65	76.86	82.16
split2+正序	77.28	88.35	86.64	89.29	73.60	${\underline{83.06}}$
split2+反序	76.66	${\underline{88.88}}$	86.27	90.88	71.31	83.87
split3+正序	78.72	89.25	87.02	91.21	78.20	83.03
split3+反序	78.91	86.46	86.99	90.66	76.65	82.79
注：粗體數(shù)字代表動作類中識別率最高，帶下劃線數(shù)字代表動作類的識別率次高。

下載: 導(dǎo)出CSV

表 3 VADMMR在5類動作上的識別準(zhǔn)確度(%)

動作類	轉(zhuǎn)呼啦圈	鍵盤打字	軍隊行進(jìn)	彈吉他	擲鐵餅	類平均
VADMMR	83.77	87.43	88.83	91.58	79.83	84.67

下載: 導(dǎo)出CSV

表 4 本文提出的VADMMR與其它動作識別方法對比

文獻(xiàn)	技術(shù)策略	年份	平均識別率(%)
文獻(xiàn)[9]	Spatial Stream ConvNet	2014	73.0
文獻(xiàn)[9]	Temporal Stream ConvNet	2014	83.7
文獻(xiàn)[24]	Composite LSTM	2015	84.3
文獻(xiàn)[7]	動態(tài)圖像網(wǎng)絡(luò)(MDI)	2016	70.9
文獻(xiàn)[23]	Spatial-C3D	2017	83.6
本文方法	VADMMR	2018	84.67

下載: 導(dǎo)出CSV

參考文獻(xiàn)(24)

BLACKBURN J and RIBEIRO E. Human Motion Recognition Using Isomap and Dynamic Time Warping[M]. Berlin Heidelberg: Springer, 2007: 285–298.

QU Hang and CHENG Jian. Human action recognition based on adaptive distance generalization of isometric mapping[C]. Proceedings of the International Congress on Image and Signal Processing, Bangalore, India, 2013: 95–98. doi: 10.1109/cisp.2012.6469785.

WANG Heng, KL?SER A, SCHMID C, et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International Journal of Computer Vision, 2013, 103(1): 60–79. doi: 10.1007/s11263-012-0594-8

WANG Heng and SCHMID C. Action recognition with improved trajectories[C]. Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 2013: 3551–3558. doi: 10.1109/iccv.2013.441.

OHNISHI K, HIDAKA M, and HARADA T. Improved dense trajectory with cross streams[C]. ACM on Multimedia Conference, Amsterdam, Holland, 2016: 257–261. doi: 10.1145/2964284.2967222.

AHAD M A R, TAN J, KIM H, et al. Action recognition by employing combined directional motion history and energy images[C]. IEEE Conference On Computer Vision and Pattern Recognition. San Francisco, USA, 2010: 73–78. doi: 10.1109/CVPRW.2010.5543160.

BILEN H, FERNANDO B, GAVVES E, et al. Dynamic image networks for action recognition[C]. Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 3034–3042. doi: 10.1109/cvpr.2016.331.

CHERIAN A, FERNANDO B, HARANDI M, et al. Generalized rank pooling for activity recognition[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, 2017: 3222–3231. doi: 10.1109/cvpr.2017.172.

SIMONYAN K and ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]. Proceedings of the International Conference on Neural Information Processing Systems, Sarawak, Malaysia, 2014: 568–576. doi: 10.1109/iccvw.2017.368.

LIU Hong, TU Juanhui, and LIU Mengyuan. Two-stream 3D convolutional neural network for skeleton-based action recognition[OL]. https://arxiv.org/abs/1705.08106, 2017.

MOLCHANOV P, GUPTA S, KIM K, et al. Hand gesture recognition with 3D convolutional neural networks[C]. Proceedings of the Computer Vision and Pattern Recognition Workshops, Boston, USA, 2015: 1–7. doi: 10.1109/cvprw.2015.7301342.

ZHU Yi, LAN Zhenzhong, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[OL]. https://arxiv.org/abs/1704.00389, 2017.

WEI Xiao, SONG Li, XIE Rong, et al. Two-stream recurrent convolutional neural networks for video saliency estimation[C]. Proceedings of the IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, Cagliari, Italy, 2017: 1–5. doi: 10.1109/bmsb.2017.7986223.

SHI Yemin, TIAN Yonghong, WANG Yaowei, et al. Sequential deep trajectory descriptor for action recognition with three-stream CNN[J]. IEEE Transactions on Multimedia, 2017, 19(7): 1510–1520. doi: 10.1109/TMM.2017.2666540

SONG Sibo, CHANDRASEKHAR V, MANDAL B, et al. Multimodal multi-stream deep learning for egocentric activity recognition[C]. Proceedings of the Computer Vision and Pattern Recognition Workshops, Las Vegas, USA, 2016: 24–31. doi: 10.1109/cvprw.2016.54.

NISHIDA N and NAKAYAMA H. Multimodal Gesture Recognition Using Multi-Stream Recurrent Neural Network[M]. New York, Springer-Verlag, Inc., 2015: 682–694.

朱麗, 吳雨川, 胡峰, 等. 老年人動作識別系統(tǒng)研究[J]. 計算機工程與應(yīng)用, 2017, 53(14): 24–31. doi: 10.3778/j.issn.1002-8331.1703-0470

ZHU Li, WU Yuchuan, HU Feng, et al. Study on action recognition system for the aged[J]. Computer engineering and Application, 2017, 53(14): 24–31. doi: 10.3778/j.issn.1002-8331.1703-0470

壽質(zhì)彬. 基于神經(jīng)網(wǎng)絡(luò)模型融合的圖像識別研究[D]. [碩士論文], 華南理工大學(xué), 2015.

SHOU Zhibin. Research on image recognition base on neural networks and model Combination[D]. [Master dissertation], South China University of Technology, 2015.

HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.

DIETTERICH T G. Ensemble methods in machine learning[J]. 1st International Workshgp on Multiple Classifier Systems, 2000, 1857(1): 1–15. doi: 10.1007/3-540-45014-9_1

FERNANDO B, GAVVES E, ORAMAS M J, et al. Modeling video evolution for action recognition[C]. Proceedings of the Computer Vision and Pattern Recognition, Boston, USA, 2015: 5378–5387. doi: 10.1109/cvpr.2015.7299176.

SOOMRO K, ZAMIR A R, and SHAH M. UCF101: A dataset of 101 human actions classes from videos in the wild[OL]. https://arxiv.org/abs/1212.0402, 2012.

TRAN A and CHEONG L F. Two-stream flow-guided convolutional attention networks for action recognition[C]. Proceedings of the IEEE International Conference on Computer Vision Workshop, Venice, Italy, 2017: 3110–3119. doi: 10.1109/iccvw.2017.368.

SRIVASTAVA N, MANSIMOV E, and SALAKHUTDINOV R. Unsupervised learning of video representations using LSTMs[C]. International Conference on Machine Learning, Lille, France, 2015: 843–852.

相關(guān)文章

施引文獻(xiàn)

資源附件(0)

訪問統(tǒng)計