一级黄色片免费播放|中国黄色视频播放片|日本三级a|可以直接考播黄片影视免费一级毛片

高級搜索

留言板

尊敬的讀者、作者、審稿人, 關(guān)于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復(fù)。謝謝您的支持!

姓名
郵箱
手機號碼
標(biāo)題
留言內(nèi)容
驗證碼

基于多尺度時空卷積的唇語識別方法

葉鴻 危勁松 賈兆紅 鄭輝 梁棟 唐俊

葉鴻, 危勁松, 賈兆紅, 鄭輝, 梁棟, 唐俊. 基于多尺度時空卷積的唇語識別方法[J]. 電子與信息學(xué)報, 2024, 46(11): 4170-4177. doi: 10.11999/JEIT240161
引用本文: 葉鴻, 危勁松, 賈兆紅, 鄭輝, 梁棟, 唐俊. 基于多尺度時空卷積的唇語識別方法[J]. 電子與信息學(xué)報, 2024, 46(11): 4170-4177. doi: 10.11999/JEIT240161
YE Hong, WEI Jinsong, JIA Zhaohong, ZHENG Hui, LIANG Dong, TANG Jun. Lipreading Method Based on Multi-Scale Spatiotemporal Convolution[J]. Journal of Electronics & Information Technology, 2024, 46(11): 4170-4177. doi: 10.11999/JEIT240161
Citation: YE Hong, WEI Jinsong, JIA Zhaohong, ZHENG Hui, LIANG Dong, TANG Jun. Lipreading Method Based on Multi-Scale Spatiotemporal Convolution[J]. Journal of Electronics & Information Technology, 2024, 46(11): 4170-4177. doi: 10.11999/JEIT240161

基于多尺度時空卷積的唇語識別方法

doi: 10.11999/JEIT240161
基金項目: 國家自然科學(xué)基金(71971002, 62273001),安徽省自然科學(xué)基金(2108085QA35),安徽省重點研究與開發(fā)計劃(202004a07020050),安徽省科技重大專項(202003A06020016),安徽省高校優(yōu)秀科研創(chuàng)新團隊(2022AH010005)
詳細信息
    作者簡介:

    葉鴻:男,碩士生導(dǎo)師,研究方向為深度學(xué)習(xí)、人工智能、體系架構(gòu)優(yōu)化、并行計算

    危勁松:男,碩士生,研究方向為計算機視覺

    賈兆紅:女,教授,研究方向為人工智能、決策支持、多目標(biāo)優(yōu)化

    鄭輝:男,講師,研究方向為多模態(tài)感知、計算機視覺

    梁棟:男,教授,研究方向為計算機視覺與模式識別、信號處理與智能系統(tǒng)

    唐?。耗?,教授,研究方向為計算機視覺與機器學(xué)習(xí)

    通訊作者:

    鄭輝 huizheng@ahu.edu.cn

  • 中圖分類號: TN911.73; TP391.41

Lipreading Method Based on Multi-Scale Spatiotemporal Convolution

Funds: The National Natural Science Foundation of China (71971002, 62273001), The Provincial Natural Science Foundation of Anhui (2108085QA35), Anhui Provincial Key Research and Development Project (202004a07020050), Anhui Provincial Major Science and Technology Project (202003A06020016), The Excellent Research and Innovation Teams in Anhui Province’s Universities (2022AH010005)
  • 摘要: 現(xiàn)有的唇語識別模型大多采用將單層的3維卷積與2維卷積神經(jīng)網(wǎng)絡(luò)結(jié)合的方式,從唇語視頻序列中挖掘出時空聯(lián)合特征。然而,由于單層的3維卷積不能很好地提取時間信息,同時2維卷積神經(jīng)網(wǎng)絡(luò)對細粒度的唇語特征的挖掘能力有限,該文提出一種多尺度唇語識別網(wǎng)絡(luò)(MS-LipNet)以改善唇語識別任務(wù)。該文在Res2Net網(wǎng)絡(luò)中,采用3維時空卷積替代傳統(tǒng)的2維卷積以更好地提取時空聯(lián)合特征,同時提出時空坐標(biāo)注意力模塊,使網(wǎng)絡(luò)關(guān)注于任務(wù)相關(guān)的重要區(qū)域特征。在LRW和LRW-1000數(shù)據(jù)集上進行實驗,驗證了所提方法的有效性。
  • 圖  1  MS-LipNet整體框架

    圖  2  STCA注意力模塊結(jié)構(gòu)圖

    圖  3  STCA子模塊結(jié)構(gòu)圖

    圖  4  ST-Res2Net結(jié)構(gòu)圖

    圖  5  模型的顯著性圖對比

    表  1  不同方法在LRW和LRW-1000數(shù)據(jù)集上的識別準(zhǔn)確率對比(%)

    方法 LRW LRW-1000
    Two-Stream ResNet18+BiLSTM[18] 84.07
    2×ResNet18+BiGRU[19] 84.13 41.93
    ResNet18+3×BiGRU+MI[30] 84.41 38.79
    ResNet18+MS-TCN[9] 85.30 41.40
    SE-ResNet18+BiGRU[13] 85.00 48.00
    3D-ResNet18+BiGRU+TSM[20] 86.23 44.60
    ResNet18+HPConv+self-attention[19] 86.83
    WPCL+APFF[31] 88.30 49.40
    ResNet-18+DC-TCN[10] 88.36 43.65
    2DCNN+BiGRU+Lip Segmentation[32] 90.38
    ResNet18+DC-TCN+TimeMask[33] 90.40
    MS-LipNet 91.56 50.68
    下載: 導(dǎo)出CSV

    表  2  MS-LipNet網(wǎng)絡(luò)不同組件的消融實驗結(jié)果(%)

    數(shù)據(jù)增強 注意力模塊 LRW LRW-1000
    Mixup Cutout STCA
    × × × 90.95 50.12
    × × 91.06 50.40
    × × 91.01 50.42
    × × 91.21 50.25
    × 91.18 50.06
    × 91.39 50.56
    × 91.48 50.50
    91.56 50.68
    下載: 導(dǎo)出CSV

    表  3  Cutout的不同取值對實驗結(jié)果的影響(%)

    n_holeslengthLRWLRW-1000
    0091.3950.50
    11191.4150.51
    12291.4450.53
    14491.4250.55
    21191.4950.54
    22291.5650.68
    24491.4350.58
    31191.5050.65
    32291.3750.59
    34490.7250.51
    下載: 導(dǎo)出CSV
  • [1] TAYLOR S L, MAHLER M, THEOBALD B J, et al. Dynamic units of visual speech[C]. ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Lausanne, Switzerland, 2012: 275–284.
    [2] LI Dengshi, GAO Yu, ZHU Chenyi, et al. Improving speech recognition performance in noisy environments by enhancing lip reading accuracy[J]. Sensors, 2023, 23(4): 2053. doi: 10.3390/s23042053.
    [3] IVANKO D, RYUMIN D, and KARPOV A. Automatic lip-reading of hearing impaired people[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2019, XLII-2/W12: 97–101. doi: 10.5194/isprs-archives-XLII-2-W12-97-2019.
    [4] GONZALEZ-LOPEZ J A, GOMEZ-ALANIS A, DO?AS J M M, et al. Silent speech interfaces for speech restoration: A review[J]. IEEE Access, 2020, 8: 177995–178021. doi: 10.1109/ACCESS.2020.3026579.
    [5] EZZ M, MOSTAFA A M, and NASR A A. A silent password recognition framework based on lip analysis[J]. IEEE Access, 2020, 8: 55354–55371. doi: 10.1109/ACCESS.2020.2982359.
    [6] 王昌海, 許昱瑋, 張建忠. 基于層次分類的手機位置無關(guān)的動作識別[J]. 電子與信息學(xué)報, 2017, 39(1): 191–197. doi: 10.11999/JEIT160253.

    WANG Changhai, XU Yuwei, and ZHANG Jianzhong. Hierarchical classification-based smartphone displacement free activity recognition[J]. Journal of Electronics & Information Technology, 2017, 39(1): 191–197. doi: 10.11999/JEIT160253.
    [7] STAFYLAKIS T and TZIMIROPOULOS G. Combining residual networks with LSTMs for lipreading[C]. 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 2017.
    [8] STAFYLAKIS T, KHAN M H, and TZIMIROPOULOS G. Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs[J]. Computer Vision and Image Understanding, 2018, 176/177: 22–32. doi: 10.1016/j.cviu.2018.10.003.
    [9] MARTINEZ B, MA Pingchuan, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 6319–6323. doi: 10.1109/ICASSP40776.2020.9053841.
    [10] MA Pingchuan, WANG Yijiang, SHEN Jie, et al. Lip-reading with densely connected temporal convolutional networks[C]. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, USA, 2021: 2856–2865. doi: 10.1109/WACV48630.2021.00290.
    [11] TRAN D, WANG Heng, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6450–6459. doi: 10.1109/CVPR.2018.00675.
    [12] QIU Zhaofan, YAO Ting, and MEI Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]. Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5534–5542. doi: 10.1109/ICCV.2017.590.
    [13] CHEN Hang, DU Jun, HU Yu, et al. Automatic lip-reading with hierarchical pyramidal convolution and self-attention for image sequences with no word boundaries[C]. 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 2021: 3001–3005.
    [14] GAO Shanghua, CHENG Mingming, ZHAO Kai, et al. Res2Net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652–662. doi: 10.1109/TPAMI.2019.2938758.
    [15] HU Jie, SHEN Li, and SUN Gang. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745.
    [16] HOU Qibin, ZHOU Daquan, and FENG Jiashi. Coordinate attention for efficient mobile network design[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 13708–13717. doi: 10.1109/CVPR46437.2021.01350.
    [17] CHUNG J S and ZISSERMAN A. Lip reading in the wild[C]. 13th Asian Conference on Computer Vision, Taipei, China, 2017: 87–103. doi: 10.1007/978-3-319-54184-6_6.
    [18] WENG Xinshuo and KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading[C]. 30th British Machine Vision Conference 2019, Cardiff, UK, 2019.
    [19] XIAO Jingyun, YANG Shuang, ZHANG Yuanhang, et al. Deformation flow based two-stream network for lip reading[C]. 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina , 2020: 364–370. doi: 10.1109/FG47880.2020.00132.
    [20] HAO Mingfeng, MAMUT M, YADIKAR N, et al. How to use time information effectively? Combining with time shift module for lipreading[C]. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, 2021: 7988–7992. doi: 10.1109/ICASSP39728.2021.9414659.
    [21] 任永梅, 楊杰, 郭志強, 等. 基于多尺度卷積神經(jīng)網(wǎng)絡(luò)的自適應(yīng)熵加權(quán)決策融合船舶圖像分類方法[J]. 電子與信息學(xué)報, 2021, 43(5): 1424–1431. doi: 10.11999/JEIT200102.

    REN Yongmei, YANG Jie, GUO Zhiqiang, et al. Self-adaptive entropy weighted decision fusion method for ship image classification based on multi-scale convolutional neural network[J]. Journal of Electronics & Information Technology, 2021, 43(5): 1424–1431. doi: 10.11999/JEIT200102.
    [22] FENG Dalu, YANG Shuang, SHAN Shiguang, et al. Learn an effective lip reading model without pains[EB/OL]. https://arxiv.org/abs/2011.07557, 2020.
    [23] XUE Feng, YANG Tian, LIU Kang, et al. LCSNet: End-to-end lipreading with channel-aware feature selection[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(1s): 28. doi: 10.1145/3524620.
    [24] FU Yixian, LU Yuanyao, and NI Ran. Chinese lip-reading research based on ShuffleNet and CBAM[J]. Applied Sciences, 2023, 13(2): 1106. doi: 10.3390/app13021106.
    [25] DEVRIES T and TAYLOR G W. Improved regularization of convolutional neural networks with cutout[EB/OL]. https://arxiv.org/abs/1708.04552, 2017.
    [26] ZHANG Hongyi, CISSE M, DAUPHIN Y N, et al. mixup: Beyond empirical risk minimization[C]. 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
    [27] YANG Shuang, ZHANG Yuanhang, FENG Dalu, et al. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild[C]. 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 2019: 1–8. doi: 10.1109/FG.2019.8756582.
    [28] KING D E. Dlib-ml: A machine learning toolkit[J]. The Journal of Machine Learning Research, 2009, 10: 1755–1758.
    [29] LOSHCHILOV I and HUTTER F. Decoupled weight decay regularization[C]. 7th International Conference on Learning Representations, New Orleans, USA, 2017.
    [30] ZHAO Xing, YANG Shuang, SHAN Shiguang, et al. Mutual information maximization for effective lip reading[C]. 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 2020: 420–427. doi: 10.1109/FG47880.2020.00133.
    [31] TIAN Weidong, ZHANG Housen, PENG Chen, et al. Lipreading model based on whole-part collaborative learning[C]. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022: 2425–2429. doi: 10.1109/ICASSP43922.2022.9747052.
    [32] MILED M, MESSAOUD M A B, and BOUZID A. Lip reading of words with lip segmentation and deep learning[J]. Multimedia Tools and Applications, 2023, 82(1): 551–571. doi: 10.1007/s11042-022-13321-0.
    [33] MA Pingchuan, WANG Yujiang, PETRIDIS S, et al. Training strategies for improved lip-reading[C]. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022: 8472–8476. doi: 10.1109/ICASSP43922.2022.9746706.
  • 加載中
圖(5) / 表(3)
計量
  • 文章訪問數(shù):  290
  • HTML全文瀏覽量:  124
  • PDF下載量:  49
  • 被引次數(shù): 0
出版歷程
  • 收稿日期:  2024-03-12
  • 修回日期:  2024-09-10
  • 網(wǎng)絡(luò)出版日期:  2024-09-16
  • 刊出日期:  2024-11-10

目錄

    /

    返回文章
    返回