基于子帶雙特征的自適應(yīng)保留似然比魯棒語音檢測(cè)算法
doi: 10.11999/JEIT160157
國(guó)家自然科學(xué)基金 (61571192),廣東省公益項(xiàng)目(2015A010103003),中央高?;究蒲袠I(yè)務(wù)費(fèi)項(xiàng)目華南理工大學(xué)(2015ZM143)
Adaptively Reserved Likelihood Ratio-based Robust Voice Activity Detection with Sub-band Double Features
The National Natural Science Foundation of China (61571192), The Science and Technology Foundation of Guangdong Province (2015A010103003), The Fundamental Research Funds for the Central Universities, SCUT (2015ZM143)
-
摘要: 為了進(jìn)一步提高低信噪比下語音激活檢測(cè)(VAD)的準(zhǔn)確率,該文提出一種基于子帶雙特征的自適應(yīng)保留似然比魯棒語音激活檢測(cè)算法。算法采用子帶歸一化最大自相關(guān)函數(shù)與子帶歸一化平均過零率雙重特征設(shè)置頻率分量似然比的保留權(quán)值,同時(shí)利用已過去固定時(shí)長(zhǎng)的VAD判決結(jié)果及對(duì)應(yīng)的子帶特征參數(shù)自適應(yīng)地估計(jì)似然比的保留閾值。實(shí)驗(yàn)結(jié)果表明,此算法的VAD檢測(cè)準(zhǔn)確率相比原保留似然比算法在10 dB, 0 dB和-10 dB平穩(wěn)白噪聲下分別提高了1.2%, 7.2%和8.1%,在10 dB和0 dB非平穩(wěn)Babble噪聲下分別提高了1.6%和3.4%。當(dāng)其被用于2.4 kbps低速率聲碼器系統(tǒng)時(shí),合成語音的感知語音質(zhì)量評(píng)價(jià)(PESQ)比原聲碼器系統(tǒng)在白噪聲下提高了0.098~0.153,在Babble噪聲下提高了0.157~0.186。
-
關(guān)鍵詞:
- 語音激活檢測(cè) /
- 似然比 /
- 低信噪比 /
- 子帶過零率
Abstract: In order to improve the correct rate of Voice Activity Detection (VAD) in low Signal Noise Ratio (SNR) environment, the paper presents an adaptive reserved likelihood ratio VAD method, which is based on sub-band double features. The method employs sub-band auto correlate function and sub-band zero crossing rate in the process of setting reserved weight. Reserved threshold is estimated adaptively according to the passed VAD results and their sub-band feature parameters. The experiment shows its promising performance in comparison with similar algorithms, the VAD correct rate is improved by 1.2%, 7.2%, and 8.1% respectively in 10 dB, 0 dB, and -10 dB stationary white noisy environment, 1.6% and 3.4% respectively in 10 dB and 0 dB non-stationary Babble noisy environment. The method is also applied to 2.4 kbps low bit rate vocoder and the Perceptual Evaluation of Speech Quality (PESQ) is improved by 0.098~0.153 in white noisy environment, 0.157~0.186 in Babble noisy environment. -
SREEKUMAR K T, GEORGE K K, ARUNRAJ K, et al. Spectral matching based voice activity detector for improved speaker recognition[C]. 2014 International Conference on Power Signals Control and Computations (EPSCICON), Thrissur, 2014: 1-4. doi: 10.1109/EPSCICON.2014.6887507. DUTA C L, GHEORGHE L, and TAPUS N. Real time implementation of MELP speech compression algorithm using Blackfin processors[C]. 2015 9th International Symposium on Image and Signal Processing and Analysis (ISPA), Zagreb, 2015: 250-255. doi: 10.1109/ISPA.2015. 7306067. CHUL Y I, HYEONTAEK L, and DONGSUK Y. Formant-based robust voice activity detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(12): 2238-2245. doi: 10.1109/TASLP. 2015.2476762. JONGSEO S, NAM SOO K, and WONYONG S. A statistical model-based voice activity detection[J]. IEEE Signal Processing Letters, 1999, 6(1): 1-3. doi: 10.1109/97. 736233. DUK C Y, AL-NAIMI K, and KONDOZ A. Improved voice activity detection based on a smoothed statistical likelihood ratio[C]. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City, 2001: 737-740. doi: 10.1109/ICASSP.2001.941020. RAMIREZ J, SEGURA J, BENITEZ C, et al. Statistical voice activity detection using a multiple observation likelihood ratio test[J]. IEEE Signal Process Letters, 2005, 12(10): 689-692. doi: 10.1109/LSP.2005.855551. RAMIREZ J, SEGURA J C, GORRIZ J M, et al. Improved voice activity detection using contextual multiple hypothesis testing for robust speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(8): 2177-2189. doi: 10.1109/TASL.2007.903937. ICK K S, HAING J Q, and HYUK C J. Discriminative weight training for a statistical model-based voice activity detection[J]. IEEE Signal Processing Letters, 2008, 15: 170-173. doi: 10.1109/LSP.2007.913595. YOUNGJOO S and HOIRIN K. Multiple acoustic model-based discriminative likelihood ratio weighting for voice activity detection[J]. Signal Processing Letters, 2012, 19(8): 507-510. doi: 10.1109/LSP.2012.2204978. FERRONI G, BONFIGLI R, PRINCIPI E, et al. A deep neural network approach for voice activity detection in multi-room domestic scenarios[C]. 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, 2015: 1-8. doi: 10.1109/IJCNN.2015.7280510. INYOUNG H and JOON HYUK C. Voice activity detection based on statistical model employing deep neural network[C]. 2014 Tenth International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), 2014: 582-585. doi: 10.1109/IIH-MSP.2014.150. TAN Yingwei, LIU Wenju, WEI J, et al. Hybrid SVM/HMM architectures for statistical model-based voice activity detection[C]. 2014 International Joint Conference on Neural Networks (IJCNN), Beijing, 2014: 2875-2878. doi: 10.1109/ IJCNN.2014.6889403. 何偉俊, 賀前華, 劉楊. 基于子帶保留似然比的魯棒語音激活檢測(cè)算法[J]. 華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版), 2015, 43(11): 78-82. doi: 10.13245/j.hust.151115. HE Weijun, HE Qianhua, and LIU Yang. Sub-band reserved likelihood ratio-based robust voice activity detection[J]. Journal of Huazhong University of Science and Technology (Natural Science Edition), 2015, 43(11): 78-82. doi: 10.13245/ j.hust.151115. PEARLMAN W A and GRAY R M. Source coding of the discrete Fourier transform[J]. IEEE Transactions on Information Theory, 1978, 24(6): 683-692. doi: 10.1109/TIT. 1978.1055950. GERKMANN T and HENDRIKS R C. Unbiased MMSE-based noise power estimation with low complexity and low tracking delay[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(4): 1383-1393. doi: 10.1109/TASL.2011.2180896. EPHRAIM Y and MALAH D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator[J]. IEEE Transactions on Acoustics, Speech and Signal Processing, 1984, 32(6): 1109-1121. doi: 10.1109/ TASSP.1984.1164453. 趙力. 語音信號(hào)處理[M]. 第2版, 北京: 機(jī)械工業(yè)出版社, 2009: 38-39. ZHAO Li. Speech Signal Processing[M]. Second edition, Beijing: China Machine Press, 2009: 38-39. MOUSAZADEH S and COHEN I. Voice activity detection in presence of transient noise using spectral clustering[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(6): 1261-1271. doi: 10.1109/TASL.2013.2248717. PETSATODIS T, BOUKIS C, and TALANTZIS F. Convex combination of multiple statistical models with application to VAD[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(8): 2314-2327. doi: 10.1109/TASL.2011. 2131131. -
計(jì)量
- 文章訪問數(shù): 1094
- HTML全文瀏覽量: 150
- PDF下載量: 353
- 被引次數(shù): 0