基于Transformer和多模態(tài)對齊的非自回歸手語翻譯技術(shù)研究

邵舒羽; 杜垚; 范曉麗

doi:10.11999/JEIT230801

基于Transformer和多模態(tài)對齊的非自回歸手語翻譯技術(shù)研究

doi: 10.11999/JEIT230801

1.
北京物資學(xué)院物流學(xué)院北京 101149
2.
北京航空航天大學(xué)自動化科學(xué)與電氣工程學(xué)院北京 100191
3.
空軍特色醫(yī)學(xué)中心北京 100142

基金項目: 國家自然科學(xué)基金(8210072143)，北京市教委科技計劃青年項目(KM202210037001)

詳細(xì)信息

作者簡介:
邵舒羽：男，副教授，研究方向為信號處理、復(fù)雜系統(tǒng)可靠性分析

杜垚：男，博士生，研究方向為模式識別

范曉麗：女，高級工程師，研究方向為生物醫(yī)學(xué)信號處理、模式識別

通訊作者:
邵舒羽　shaoshuyu@bwu.edu.cn

中圖分類號: TN108.4; TP391
計量
- 文章訪問數(shù): 518
- HTML全文瀏覽量: 356
- PDF下載量: 90
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2023-08-01
- 修回日期: 2023-12-27
- 網(wǎng)絡(luò)出版日期: 2024-01-08
- 刊出日期: 2024-07-29

Non-Autoregressive Sign Language Translation Technology Based on Transformer and Multimodal Alignment

SHAO Shuyu^{1
, ,},
DU Yao²,
FAN Xiaoli³

1.
School of Logistics, Beijing Wuzi University, Beijing 101149, China
2.
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China
3.
Air force medical center, PLA, Beijing 101142, China

Funds: The National Natural Science Foundation of China (8210072143), R&D Program of Beijing Municipal Education Commission (KM202210037001)

摘要

摘要: 為了解決多模態(tài)數(shù)據(jù)的對齊及手語翻譯速度較慢的問題，該文提出一個基于自注意力機(jī)制模型Transformer的非自回歸手語翻譯模型(Trans-SLT-NA)，同時引入了對比學(xué)習(xí)損失函數(shù)進(jìn)行多模態(tài)數(shù)據(jù)的對齊，通過學(xué)習(xí)輸入序列(手語視頻)和目標(biāo)序列(文本)的上下文信息和交互信息，實現(xiàn)一次性地將手語翻譯為自然語言。該文所提模型在公開數(shù)據(jù)集PHOENIX-2014T(德語)、CSL(中文)和How2Sign(英文)上進(jìn)行實驗評估，結(jié)果表明該文方法相比于自回歸模型翻譯速度提升11.6～17.6倍，同時在雙語評估輔助指標(biāo)(BLEU-4)、自動摘要評估指標(biāo)(ROUGE)指標(biāo)上也接近自回歸模型。
- 手語翻譯 /
- 自注意力機(jī)制 /
- 非自回歸翻譯 /
- 深度學(xué)習(xí) /
- 多模態(tài)數(shù)據(jù)對齊
Abstract: To address the challenge of aligning multimodal data and improving the slow translation speed in sign language translation, a Transformer Sign Language Translation Non-Autoregression (Trans-SLT-NA) is proposed in this paper, which utilizes a self-attention mechanism. Additionally, it incorporates a contrastive learning loss function to align the multimodal data. By capturing the contextual and interaction information between the input sequence (sign language videos) and the target sequence (text), the proposed model is able to perform sign language translation to natural language in s single step. The effectiveness of the proposed model is evaluated on publicly available datasets, including PHOENIX-2014-T (German), CSL (Chinese) and How2Sign (English). Results demonstrate that the proposed method achieves a significant improvement in translation speed, with a speed boost ranging from 11.6 to 17.6 times compared to autoregressive models, while maintaining comparable performance in terms of BiLingual Evaluation Understudy (BLEU-4) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics.
- Sign language translation /
- Self-attention mechanism /
- Non-autoregressive translation /
- Deep learning /
- Alignment of multimodal data

HTML全文

圖 1 基于Transformer的連續(xù)手語識別和翻譯框架

下載: 全尺寸圖片幻燈片

圖 2 Trans-SLT-NA模型總體結(jié)構(gòu)圖

下載: 全尺寸圖片幻燈片

圖 3 視頻編碼器的組成結(jié)構(gòu)

下載: 全尺寸圖片幻燈片

圖 4 使用t-SNE對視頻表征向量和文本向量的可視化

下載: 全尺寸圖片幻燈片

表 1 訓(xùn)練模型使用的數(shù)據(jù)集信息

數(shù)據(jù)集	語言	訓(xùn)練集	驗證集	測試集	總數(shù)
PHOENIX-2014T	德語	7 096	519	642	8 257
CSL-Daily	中文	18 401	1 077	1 176	20 654
How2Sign	英文	31 128	1 741	2 322	35 191

下載: 導(dǎo)出CSV

表 2 模型在PHOENIX-2014T數(shù)據(jù)集上的結(jié)果

方法	生成方式	驗證集		測試集		推理速度
方法	生成方式	BLEU-4	ROUGE	BLEU-4	ROUGE	推理速度
RNN-based^[15]	AR	9.94	31.8	9.58	31.8	2.3X
SLTR-T^[5]	AR	20.69	–	20.17	–	1.0X
Multi-C^[26]	AR	19.51	44.59	18.51	43.57	–
STMC-T^[24]	AR	24.09	48.24	23.65	46.65	–
PiSLTRc^[17]	AR	21.48	47.89	21.29	48.13	0.92X
Trans-SLT-NA	NAR	18.81	47.32	19.03	48.22	11.6X
注：AR表示自回歸生成方式，NAR表示非自回歸生成。

下載: 導(dǎo)出CSV

表 3 CSL-Daily數(shù)據(jù)集上的對比結(jié)果

方法	生成方式	驗證集		測試集		推理速度
方法	生成方式	BLEU-4	ROUGE	BLEU-4	ROUGE	推理速度
SLTR-T^[5]	AR	11.88	37.06	11.79	36.74	1X
Sign Back-Tran^[19]	AR	20.80	49.49	21.34	49.31	0.89X
ConSLT^[27]	AR	14.80	41.46	14.53	40.98	–
Trans-SLT-NA	NAR	16.22	43.74	16.72	44.67	13.4X

下載: 導(dǎo)出CSV

表 4 How2Sign數(shù)據(jù)集上的對比結(jié)果

方法	生成方式	驗證集		測試集		推理速度
方法	生成方式	BLEU-4	ROUGE	BLEU-4	ROUGE	推理速度
Baseline	AR	8.89	–	8.03	–	1X
Trans-SLT-NA	NAR	8.14	32.84	8.58	33.17	17.6X

下載: 導(dǎo)出CSV

表 5 多模態(tài)數(shù)據(jù)對齊的有效性驗證

模型	數(shù)據(jù)集	數(shù)據(jù)對齊	驗證集		測試集
模型	數(shù)據(jù)集	數(shù)據(jù)對齊	BLEU-4	ROUGE	BLEU-4	ROUGE
Trans-SLT-NA	PHOENIX-2014T	w	18.81	47.32	19.03	48.22
	PHOENIX-2014T	w/o	16.02	43.21	15.97	42.85
	CSL-Daily	w	16.22	43.74	16.72	44.67
	CSL-Daily	w/o	14.43	42.27	15.21	42.84
	How2Sign	w	8.14	32.84	8.58	33.17
	How2Sign	w/o	7.81	30.16	8.23	30.59
注：w表示使用數(shù)據(jù)對齊，w/o表示不使用數(shù)據(jù)對齊。

下載: 導(dǎo)出CSV

表 6 空間Embedding對于模型性能的影響結(jié)果

空間Embedding	預(yù)訓(xùn)練	驗證集		測試集
空間Embedding	預(yù)訓(xùn)練	BLEU-4	ROUGE	BLEU-4	ROUGE
VGG-19	w/o	14.42	38.76	14.36	39.17
ResNet-50		15.57	40.26	15.33	41.17
EfficientNet-B0		16.32	40.11	16.04	41.27
VGG-19	w	16.84	43.31	16.17	42.09
ResNet-50		17.79	45.63	16.93	44.53
EfficientNet-B0		18.81	47.32	19.03	48.22

下載: 導(dǎo)出CSV

表 7 損失函數(shù)超參數(shù)對于模型性能的結(jié)果

${\lambda _{\mathrm{p}}}$	${\lambda _{\mathrm{c}}}$	驗證集		測試集
${\lambda _{\mathrm{p}}}$	${\lambda _{\mathrm{c}}}$	BLEU-4	ROUGE	BLEU-4	ROUGE
1	0	16.02	43.21	15.97	42.85
0.8	0.2	17.37	44.89	16.87	42.46
0.5	0.5	18.81	47.32	19.03	48.22
0.2	0.8	18.04	46.17	18.26	47.10

下載: 導(dǎo)出CSV

參考文獻(xiàn)(27)

[1]	閆思伊, 薛萬利, 袁甜甜. 手語識別與翻譯綜述[J]. 計算機(jī)科學(xué)與探索, 2022, 16(11): 2415–2429. doi: 10.3778/j.issn.1673-9418.2205003. YAN Siyi, XUE Wanli, and YUAN Tiantian. Survey of sign language recognition and translation[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(11): 2415–2429. doi: 10.3778/j.issn.1673-9418.2205003.
[2]	陶唐飛, 劉天宇. 基于手語表達(dá)內(nèi)容與表達(dá)特征的手語識別技術(shù)綜述[J]. 電子與信息學(xué)報, 2023, 45(10): 3439–3457. doi: 10.11999/JEIT221051. TAO Tangfei and LIU Tianyu. A survey of sign language recognition technology based on sign language expression content and expression characteristics[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3439–3457. doi: 10.11999/JEIT221051.
[3]	DUARTE A, PALASKAR S, VENTURA L, et al. How2Sign: A large-scale multimodal dataset for continuous American sign language[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2734–2743. doi: 10.1109/CVPR46437.2021.00276.
[4]	周樂員, 張劍華, 袁甜甜, 等. 多層注意力機(jī)制融合的序列到序列中國連續(xù)手語識別和翻譯[J]. 計算機(jī)科學(xué), 2022, 49(9): 155–161. doi: 10.11896/jsjkx.210800026. ZHOU Leyuan, ZHANG Jianhua, YUAN Tiantian, et al. Sequence-to-sequence Chinese continuous sign language recognition and translation with multilayer attention mechanism fusion[J]. Computer Science, 2022, 49(9): 155–161. doi: 10.11896/jsjkx.210800026.
[5]	CAMG?Z N C, KOLLER O, HADFIELD S, et al. Sign language transformers: Joint end-to-end sign language recognition and translation[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2020: 10020–10030. doi: 10.1109/CVPR42600.2020.01004.
[6]	HUANG Jie, ZHOU Wengang, ZHANG Qilin, et al. Video-based sign language recognition without temporal segmentation[C]. 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 2257–2264. doi: 10.1609/aaai.v32i1.11903.
[7]	ZHOU Hao, ZHOU Wengang, and LI Houqiang. Dynamic pseudo label decoding for continuous sign language recognition[C]. 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 2019: 1282–1287. doi: 10.1109/ICME.2019.00223.
[8]	SONG Peipei, GUO Dan, XIN Haoran, et al. Parallel temporal encoder for sign language translation[C]. 2019 IEEE International Conference on Image Processing (ICIP), Taipei, China, 2019: 1915–1919. doi: 10.1109/ICIP.2019.8803123.
[9]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[10]	路飛, 韓祥祖, 程顯鵬, 等. 基于輕量3D CNNs和Transformer的手語識別[J]. 華中科技大學(xué)學(xué)報:自然科學(xué)版, 2023, 51(5): 13–18. doi: 10.13245/j.hust.230503. LU Fei, HAN Xiangzu, CHENG Xianpeng, et al. Sign language recognition based on lightweight 3D CNNs and transformer[J]. Journal of Huazhong University of Science and Technology:Natural Science Edition, 2023, 51(5): 13–18. doi: 10.13245/j.hust.230503.
[11]	WANG Hongyu, MA Shuming, DONG Li, et al. DeepNet: Scaling transformers to 1, 000 layers[EB/OL]. https://arxiv.org/abs/2203.00555, 2022.
[12]	KISHORE P V V, KUMAR D A, SASTRY A S C S, et al. Motionlets matching with adaptive kernels for 3-D Indian sign language recognition[J]. IEEE Sensors Journal, 2018, 18(8): 3327–3337. doi: 10.1109/JSEN.2018.2810449.
[13]	XIAO Yisheng, WU Lijun, GUO Junliang, et al. A survey on non-autoregressive generation for neural machine translation and beyond[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10): 11407–11427. doi: 10.1109/TPAMI.2023.3277122.
[14]	LI Feng, CHEN Jingxian, and ZHANG Xuejun. A survey of non-autoregressive neural machine translation[J]. Electronics, 2023, 12(13): 2980. doi: 3390/electronics12132980.
[15]	CAMGOZ N C, HADFIELD S, KOLLER O, et al. Neural sign language translation[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7784–7793. doi: 10.1109/CVPR.2018.00812.
[16]	ARVANITIS N, CONSTANTINOPOULOS C, and KOSMOPOULOS D. Translation of sign language glosses to text using sequence-to-sequence attention models[C]. 2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), Sorrento, Italy, 2019: 296–302. doi: 10.1109/SITIS.2019.00056.
[17]	XIE Pan, ZHAO Mengyi, and HU Xiaohui. PiSLTRc: Position-informed sign language transformer with content-aware convolution[J]. IEEE Transactions on Multimedia, 2022, 24: 3908–3919. doi: 10.1109/TMM.2021.3109665.
[18]	CHEN Yutong, WEI Fangyun, SUN Xiao, et al. A simple multi-modality transfer learning baseline for sign language translation[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, 2022: 5110–5120. doi: 10.1109/CVPR52688.2022.00506.
[19]	ZHOU Hao, ZHOU Wengang, QI Weizhen, et al. Improving sign language translation with monolingual data by sign back-translation[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, USA, 2021: 1316–1325. doi: 10.1109/CVPR46437.2021.00137.
[20]	ZHENG Jiangbin, WANG Yile, TAN Cheng, et al. CVT-SLR: Contrastive visual-textual transformation for sign language recognition with variational alignmen[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 23141–23150. doi: 10.1109/CVPR52729.2023.02216.
[21]	GU Jiatao, BRADBURY J, XIONG Caiming, et al. Non-autoregressive neural machine translation[C]. 6th International Conference on Learning Representations, Vancouver, Canada, 2018. doi: 10.48550/arXiv.1711.02281.
[22]	WANG Yiren, TIAN Fei, HE Di, et al. Non-autoregressive machine translation with auxiliary regularization[C]. The 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 5377–5384. doi: 10.1609/aaai.v33i01.33015377.
[23]	XIE Pan, LI Zexian, ZHAO Zheng, et al. MvSR-NAT: Multi-view subset regularization for non-autoregressive machine translation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022: 1–10. doi: 10.1109/TASLP.2022.3221043.
[24]	ZHOU HAO, ZHOU Wengang, ZHOU Yun, et al. Spatial-temporal multi-cue network for sign language recognition and translation[J]. IEEE Transactions on Multimedia, 2022, 24: 768–779. doi: 10.1109/TMM.2021.3059098.
[25]	TARRéS L, GáLLEGO G I, DUARTE A, et al. Sign language translation from instructional videos[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, Canada, 2023: 5625–5635. doi: 10.1109/CVPRW59228.2023.00596.
[26]	CAMGOZ N C, KOLLER O, HADFIELD S, et al. Multi-channel transformers for multi-articulatory sign language translation[C]. ECCV 2020 Workshops on Computer Vision, Glasgow, UK, 2020: 301–319. doi: 10.1007/978-3-030-66823-5_18.
[27]	FU Biao, YE Peigen, ZHANG Liang, et al. A token-level contrastive framework for sign language translation[C]. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10095466.

相關(guān)文章

施引文獻(xiàn)

資源附件(0)

訪問統(tǒng)計

圖(4) / 表(7)

計量

文章訪問數(shù): 518
HTML全文瀏覽量: 356
PDF下載量: 90
被引次數(shù): 0

姓名
郵箱
手機(jī)號碼
標(biāo)題
留言內(nèi)容
驗證碼

一级黄色片免费播放|中国黄色视频播放片|日本三级a|可以直接考播黄片影视免费一级毛片

留言板

基于Transformer和多模態(tài)對齊的非自回歸手語翻譯技術(shù)研究

doi: 10.11999/JEIT230801

作者簡介:
邵舒羽：男，副教授，研究方向為信號處理、復(fù)雜系統(tǒng)可靠性分析

杜垚：男，博士生，研究方向為模式識別

范曉麗：女，高級工程師，研究方向為生物醫(yī)學(xué)信號處理、模式識別

通訊作者:
邵舒羽　shaoshuyu@bwu.edu.cn

計量

Non-Autoregressive Sign Language Translation Technology Based on Transformer and Multimodal Alignment

計量

目錄

一级黄色片免费播放|中国黄色视频播放片|日本三级a|可以直接考播黄片影视免费一级毛片

留言板

基于Transformer和多模態(tài)對齊的非自回歸手語翻譯技術(shù)研究

doi: 10.11999/JEIT230801

作者簡介: 邵舒羽：男，副教授，研究方向為信號處理、復(fù)雜系統(tǒng)可靠性分析 杜垚：男，博士生，研究方向為模式識別 范曉麗：女，高級工程師，研究方向為生物醫(yī)學(xué)信號處理、模式識別

通訊作者: 邵舒羽 shaoshuyu@bwu.edu.cn

計量

出版歷程

Non-Autoregressive Sign Language Translation Technology Based on Transformer and Multimodal Alignment

計量

出版歷程

目錄

作者簡介:
邵舒羽：男，副教授，研究方向為信號處理、復(fù)雜系統(tǒng)可靠性分析

杜垚：男，博士生，研究方向為模式識別

范曉麗：女，高級工程師，研究方向為生物醫(yī)學(xué)信號處理、模式識別

通訊作者:
邵舒羽　shaoshuyu@bwu.edu.cn