一種穩(wěn)健的基于Visemic LDA的口形動(dòng)態(tài)特征及聽視覺(jué)語(yǔ)音識(shí)別
A Robust Dynamic Mouth Feature Based on Visemic LDA for Audio Visual Speech Recognition
-
摘要: 視覺(jué)特征提取是聽視覺(jué)語(yǔ)音識(shí)別研究的熱點(diǎn)問(wèn)題。文章引入了一種穩(wěn)健的基于Visemic LDA的口形動(dòng)態(tài)特征,這種特征充分考慮了發(fā)音時(shí)口形輪廓的變化及視覺(jué)Viseme劃分。文章同時(shí)提出了一利利用語(yǔ)音識(shí)別結(jié)果進(jìn)行LDA訓(xùn)練數(shù)據(jù)自動(dòng)標(biāo)注的方法。這種方法免去了繁重的人工標(biāo)注工作,避免了標(biāo)注錯(cuò)誤。實(shí)驗(yàn)表明,將VisemicLDA視覺(jué)特征引入到聽視覺(jué)語(yǔ)音識(shí)別中,可以大大地提高噪聲條件下語(yǔ)音識(shí)別系統(tǒng)的識(shí)別率;將這種視覺(jué)特征與多數(shù)據(jù)流HMM結(jié)合之后,在信噪比為10dB的強(qiáng)噪聲情況下,識(shí)別率仍可以達(dá)到80%以上。Abstract: This paper presents a robust visual feature based on Visemic LDA for audio visual speech recognition, which captures dynamic lip contour information and reflects the viseme classes of visual speech. The paper also introduces an automatic labeling method using the speech recognition results for LDA training data, which avoids the tedious manually labeling work and labeling errors. Experimental results show that the audio visual speech recognition system based on the visual features presented in this paper can greatly increase the speech recognition rate in noisy conditions. The combination of the visual feature with multi-stream HMM can bring the recognition rate of over 80% at a 10dB SNR noisy condition.
-
Potamianos G, Neti C, et al.. Recent advances in the automatic recognition of audiovisual speech[J].Proc. IEEE.2003, 91(9):1306-[2]Cootes T F, Taylor C J, et al.. Active shape models-their training and application. Computer Vision and Image Understanding,1995, 12(1): 38 - 59.[3]Neti C, Potamianos G, Luettin J, et al.. Audio visual speech recognition. Final Workshop 2000 Report, Baltimore, USA, 2000:40 - 41.[4]Rao C R. Linear Statistical Inference and Its Applications. New York, John Wiley and Sons, 1965:122 - 128.[5]Young S J, Kershaw D, Odell J, Woodland P. The HTK Book.http:∥htk.eng.cam.ac.uk/docs/docs.shtml, 2002.[6]Dupont S, Luettin J. Audio-visual speech modeling for continuous speech recognition[J].IEEE Trans. on Multimedia.2000,2(3):141- -
計(jì)量
- 文章訪問(wèn)數(shù): 2666
- HTML全文瀏覽量: 106
- PDF下載量: 738
- 被引次數(shù): 0