一级黄色片免费播放|中国黄色视频播放片|日本三级a|可以直接考播黄片影视免费一级毛片

高級搜索

留言板

尊敬的讀者、作者、審稿人, 關(guān)于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復(fù)。謝謝您的支持!

姓名
郵箱
手機(jī)號碼
標(biāo)題
留言內(nèi)容
驗證碼

基于Transformer和多模態(tài)對齊的非自回歸手語翻譯技術(shù)研究

邵舒羽 杜垚 范曉麗

邵舒羽, 杜垚, 范曉麗. 基于Transformer和多模態(tài)對齊的非自回歸手語翻譯技術(shù)研究[J]. 電子與信息學(xué)報, 2024, 46(7): 2932-2941. doi: 10.11999/JEIT230801
引用本文: 邵舒羽, 杜垚, 范曉麗. 基于Transformer和多模態(tài)對齊的非自回歸手語翻譯技術(shù)研究[J]. 電子與信息學(xué)報, 2024, 46(7): 2932-2941. doi: 10.11999/JEIT230801
SHAO Shuyu, DU Yao, FAN Xiaoli. Non-Autoregressive Sign Language Translation Technology Based on Transformer and Multimodal Alignment[J]. Journal of Electronics & Information Technology, 2024, 46(7): 2932-2941. doi: 10.11999/JEIT230801
Citation: SHAO Shuyu, DU Yao, FAN Xiaoli. Non-Autoregressive Sign Language Translation Technology Based on Transformer and Multimodal Alignment[J]. Journal of Electronics & Information Technology, 2024, 46(7): 2932-2941. doi: 10.11999/JEIT230801

基于Transformer和多模態(tài)對齊的非自回歸手語翻譯技術(shù)研究

doi: 10.11999/JEIT230801
基金項目: 國家自然科學(xué)基金(8210072143),北京市教委科技計劃青年項目(KM202210037001)
詳細(xì)信息
    作者簡介:

    邵舒羽:男,副教授,研究方向為信號處理、復(fù)雜系統(tǒng)可靠性分析

    杜垚:男,博士生,研究方向為模式識別

    范曉麗:女,高級工程師,研究方向為生物醫(yī)學(xué)信號處理、模式識別

    通訊作者:

    邵舒羽 shaoshuyu@bwu.edu.cn

  • 中圖分類號: TN108.4; TP391

Non-Autoregressive Sign Language Translation Technology Based on Transformer and Multimodal Alignment

Funds: The National Natural Science Foundation of China (8210072143), R&D Program of Beijing Municipal Education Commission (KM202210037001)
  • 摘要: 為了解決多模態(tài)數(shù)據(jù)的對齊及手語翻譯速度較慢的問題,該文提出一個基于自注意力機(jī)制模型Transformer的非自回歸手語翻譯模型(Trans-SLT-NA),同時引入了對比學(xué)習(xí)損失函數(shù)進(jìn)行多模態(tài)數(shù)據(jù)的對齊,通過學(xué)習(xí)輸入序列(手語視頻)和目標(biāo)序列(文本)的上下文信息和交互信息,實現(xiàn)一次性地將手語翻譯為自然語言。該文所提模型在公開數(shù)據(jù)集PHOENIX-2014T(德語)、CSL(中文)和How2Sign(英文)上進(jìn)行實驗評估,結(jié)果表明該文方法相比于自回歸模型翻譯速度提升11.6~17.6倍,同時在雙語評估輔助指標(biāo)(BLEU-4)、自動摘要評估指標(biāo)(ROUGE)指標(biāo)上也接近自回歸模型。
  • 圖  1  基于Transformer的連續(xù)手語識別和翻譯框架

    圖  2  Trans-SLT-NA模型總體結(jié)構(gòu)圖

    圖  3  視頻編碼器的組成結(jié)構(gòu)

    圖  4  使用t-SNE對視頻表征向量和文本向量的可視化

    表  1  訓(xùn)練模型使用的數(shù)據(jù)集信息

    數(shù)據(jù)集語言訓(xùn)練集驗證集測試集總數(shù)
    PHOENIX-2014T德語7 0965196428 257
    CSL-Daily中文18 4011 0771 17620 654
    How2Sign英文31 1281 7412 32235 191
    下載: 導(dǎo)出CSV

    表  2  模型在PHOENIX-2014T數(shù)據(jù)集上的結(jié)果

    方法 生成方式 驗證集 測試集 推理速度
    BLEU-4 ROUGE BLEU-4 ROUGE
    RNN-based[15] AR 9.94 31.8 9.58 31.8 2.3X
    SLTR-T[5] AR 20.69 20.17 1.0X
    Multi-C[26] AR 19.51 44.59 18.51 43.57
    STMC-T[24] AR 24.09 48.24 23.65 46.65
    PiSLTRc[17] AR 21.48 47.89 21.29 48.13 0.92X
    Trans-SLT-NA NAR 18.81 47.32 19.03 48.22 11.6X
    注:AR表示自回歸生成方式,NAR表示非自回歸生成。
    下載: 導(dǎo)出CSV

    表  3  CSL-Daily數(shù)據(jù)集上的對比結(jié)果

    方法 生成方式 驗證集 測試集 推理速度
    BLEU-4 ROUGE BLEU-4 ROUGE
    SLTR-T[5] AR 11.88 37.06 11.79 36.74 1X
    Sign Back-Tran[19] AR 20.80 49.49 21.34 49.31 0.89X
    ConSLT[27] AR 14.80 41.46 14.53 40.98
    Trans-SLT-NA NAR 16.22 43.74 16.72 44.67 13.4X
    下載: 導(dǎo)出CSV

    表  4  How2Sign數(shù)據(jù)集上的對比結(jié)果

    方法 生成方式 驗證集 測試集 推理速度
    BLEU-4 ROUGE BLEU-4 ROUGE
    Baseline AR 8.89 8.03 1X
    Trans-SLT-NA NAR 8.14 32.84 8.58 33.17 17.6X
    下載: 導(dǎo)出CSV

    表  5  多模態(tài)數(shù)據(jù)對齊的有效性驗證

    模型數(shù)據(jù)集數(shù)據(jù)對齊驗證集測試集
    BLEU-4ROUGEBLEU-4ROUGE
    Trans-SLT-NAPHOENIX-2014Tw18.8147.3219.0348.22
    w/o16.0243.2115.9742.85
    CSL-Dailyw16.2243.7416.7244.67
    w/o14.4342.2715.2142.84
    How2Signw8.1432.848.5833.17
    w/o7.8130.168.2330.59
    注:w表示使用數(shù)據(jù)對齊,w/o表示不使用數(shù)據(jù)對齊。
    下載: 導(dǎo)出CSV

    表  6  空間Embedding對于模型性能的影響結(jié)果

    空間Embedding 預(yù)訓(xùn)練 驗證集 測試集
    BLEU-4 ROUGE BLEU-4 ROUGE
    VGG-19 w/o 14.42 38.76 14.36 39.17
    ResNet-50 15.57 40.26 15.33 41.17
    EfficientNet-B0 16.32 40.11 16.04 41.27
    VGG-19 w 16.84 43.31 16.17 42.09
    ResNet-50 17.79 45.63 16.93 44.53
    EfficientNet-B0 18.81 47.32 19.03 48.22
    下載: 導(dǎo)出CSV

    表  7  損失函數(shù)超參數(shù)對于模型性能的結(jié)果

    ${\lambda _{\mathrm{p}}}$ ${\lambda _{\mathrm{c}}}$ 驗證集 測試集
    BLEU-4 ROUGE BLEU-4 ROUGE
    1 0 16.02 43.21 15.97 42.85
    0.8 0.2 17.37 44.89 16.87 42.46
    0.5 0.5 18.81 47.32 19.03 48.22
    0.2 0.8 18.04 46.17 18.26 47.10
    下載: 導(dǎo)出CSV
  • [1] 閆思伊, 薛萬利, 袁甜甜. 手語識別與翻譯綜述[J]. 計算機(jī)科學(xué)與探索, 2022, 16(11): 2415–2429. doi: 10.3778/j.issn.1673-9418.2205003.

    YAN Siyi, XUE Wanli, and YUAN Tiantian. Survey of sign language recognition and translation[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(11): 2415–2429. doi: 10.3778/j.issn.1673-9418.2205003.
    [2] 陶唐飛, 劉天宇. 基于手語表達(dá)內(nèi)容與表達(dá)特征的手語識別技術(shù)綜述[J]. 電子與信息學(xué)報, 2023, 45(10): 3439–3457. doi: 10.11999/JEIT221051.

    TAO Tangfei and LIU Tianyu. A survey of sign language recognition technology based on sign language expression content and expression characteristics[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3439–3457. doi: 10.11999/JEIT221051.
    [3] DUARTE A, PALASKAR S, VENTURA L, et al. How2Sign: A large-scale multimodal dataset for continuous American sign language[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2734–2743. doi: 10.1109/CVPR46437.2021.00276.
    [4] 周樂員, 張劍華, 袁甜甜, 等. 多層注意力機(jī)制融合的序列到序列中國連續(xù)手語識別和翻譯[J]. 計算機(jī)科學(xué), 2022, 49(9): 155–161. doi: 10.11896/jsjkx.210800026.

    ZHOU Leyuan, ZHANG Jianhua, YUAN Tiantian, et al. Sequence-to-sequence Chinese continuous sign language recognition and translation with multilayer attention mechanism fusion[J]. Computer Science, 2022, 49(9): 155–161. doi: 10.11896/jsjkx.210800026.
    [5] CAMG?Z N C, KOLLER O, HADFIELD S, et al. Sign language transformers: Joint end-to-end sign language recognition and translation[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2020: 10020–10030. doi: 10.1109/CVPR42600.2020.01004.
    [6] HUANG Jie, ZHOU Wengang, ZHANG Qilin, et al. Video-based sign language recognition without temporal segmentation[C]. 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 2257–2264. doi: 10.1609/aaai.v32i1.11903.
    [7] ZHOU Hao, ZHOU Wengang, and LI Houqiang. Dynamic pseudo label decoding for continuous sign language recognition[C]. 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 2019: 1282–1287. doi: 10.1109/ICME.2019.00223.
    [8] SONG Peipei, GUO Dan, XIN Haoran, et al. Parallel temporal encoder for sign language translation[C]. 2019 IEEE International Conference on Image Processing (ICIP), Taipei, China, 2019: 1915–1919. doi: 10.1109/ICIP.2019.8803123.
    [9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [10] 路飛, 韓祥祖, 程顯鵬, 等. 基于輕量3D CNNs和Transformer的手語識別[J]. 華中科技大學(xué)學(xué)報:自然科學(xué)版, 2023, 51(5): 13–18. doi: 10.13245/j.hust.230503.

    LU Fei, HAN Xiangzu, CHENG Xianpeng, et al. Sign language recognition based on lightweight 3D CNNs and transformer[J]. Journal of Huazhong University of Science and Technology:Natural Science Edition, 2023, 51(5): 13–18. doi: 10.13245/j.hust.230503.
    [11] WANG Hongyu, MA Shuming, DONG Li, et al. DeepNet: Scaling transformers to 1, 000 layers[EB/OL]. https://arxiv.org/abs/2203.00555, 2022.
    [12] KISHORE P V V, KUMAR D A, SASTRY A S C S, et al. Motionlets matching with adaptive kernels for 3-D Indian sign language recognition[J]. IEEE Sensors Journal, 2018, 18(8): 3327–3337. doi: 10.1109/JSEN.2018.2810449.
    [13] XIAO Yisheng, WU Lijun, GUO Junliang, et al. A survey on non-autoregressive generation for neural machine translation and beyond[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10): 11407–11427. doi: 10.1109/TPAMI.2023.3277122.
    [14] LI Feng, CHEN Jingxian, and ZHANG Xuejun. A survey of non-autoregressive neural machine translation[J]. Electronics, 2023, 12(13): 2980. doi: 3390/electronics12132980.
    [15] CAMGOZ N C, HADFIELD S, KOLLER O, et al. Neural sign language translation[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7784–7793. doi: 10.1109/CVPR.2018.00812.
    [16] ARVANITIS N, CONSTANTINOPOULOS C, and KOSMOPOULOS D. Translation of sign language glosses to text using sequence-to-sequence attention models[C]. 2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), Sorrento, Italy, 2019: 296–302. doi: 10.1109/SITIS.2019.00056.
    [17] XIE Pan, ZHAO Mengyi, and HU Xiaohui. PiSLTRc: Position-informed sign language transformer with content-aware convolution[J]. IEEE Transactions on Multimedia, 2022, 24: 3908–3919. doi: 10.1109/TMM.2021.3109665.
    [18] CHEN Yutong, WEI Fangyun, SUN Xiao, et al. A simple multi-modality transfer learning baseline for sign language translation[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, 2022: 5110–5120. doi: 10.1109/CVPR52688.2022.00506.
    [19] ZHOU Hao, ZHOU Wengang, QI Weizhen, et al. Improving sign language translation with monolingual data by sign back-translation[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, USA, 2021: 1316–1325. doi: 10.1109/CVPR46437.2021.00137.
    [20] ZHENG Jiangbin, WANG Yile, TAN Cheng, et al. CVT-SLR: Contrastive visual-textual transformation for sign language recognition with variational alignmen[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 23141–23150. doi: 10.1109/CVPR52729.2023.02216.
    [21] GU Jiatao, BRADBURY J, XIONG Caiming, et al. Non-autoregressive neural machine translation[C]. 6th International Conference on Learning Representations, Vancouver, Canada, 2018. doi: 10.48550/arXiv.1711.02281.
    [22] WANG Yiren, TIAN Fei, HE Di, et al. Non-autoregressive machine translation with auxiliary regularization[C]. The 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 5377–5384. doi: 10.1609/aaai.v33i01.33015377.
    [23] XIE Pan, LI Zexian, ZHAO Zheng, et al. MvSR-NAT: Multi-view subset regularization for non-autoregressive machine translation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022: 1–10. doi: 10.1109/TASLP.2022.3221043.
    [24] ZHOU HAO, ZHOU Wengang, ZHOU Yun, et al. Spatial-temporal multi-cue network for sign language recognition and translation[J]. IEEE Transactions on Multimedia, 2022, 24: 768–779. doi: 10.1109/TMM.2021.3059098.
    [25] TARRéS L, GáLLEGO G I, DUARTE A, et al. Sign language translation from instructional videos[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, Canada, 2023: 5625–5635. doi: 10.1109/CVPRW59228.2023.00596.
    [26] CAMGOZ N C, KOLLER O, HADFIELD S, et al. Multi-channel transformers for multi-articulatory sign language translation[C]. ECCV 2020 Workshops on Computer Vision, Glasgow, UK, 2020: 301–319. doi: 10.1007/978-3-030-66823-5_18.
    [27] FU Biao, YE Peigen, ZHANG Liang, et al. A token-level contrastive framework for sign language translation[C]. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10095466.
  • 加載中
圖(4) / 表(7)
計量
  • 文章訪問數(shù):  518
  • HTML全文瀏覽量:  356
  • PDF下載量:  90
  • 被引次數(shù): 0
出版歷程
  • 收稿日期:  2023-08-01
  • 修回日期:  2023-12-27
  • 網(wǎng)絡(luò)出版日期:  2024-01-08
  • 刊出日期:  2024-07-29

目錄

    /

    返回文章
    返回