首页 | 本学科首页   官方微博 | 高级检索  
     

低信噪比下基于融合网络的音素识别方法
引用本文:黄辉波,邵玉斌,龙华,杜庆治. 低信噪比下基于融合网络的音素识别方法[J]. 重庆邮电大学学报(自然科学版), 2024, 36(4): 786-796
作者姓名:黄辉波  邵玉斌  龙华  杜庆治
作者单位:昆明理工大学 信息工程与自动化学院, 昆明 650500
基金项目:云南省媒体融合重点实验室项目(220235205)
摘    要:针对低信噪比下音素识别准确率低的问题,提出一种新的识别方法。提取语音的Fbank特征,输入到由多头注意力机制、ResNet、BLSTM、CTC构建的A-R-B-CTC模型中进行音素识别,利用Wave-U-Net对语音特征Fbank、MFCC、GFCC、对数频谱进行图像去噪,发现Fbank特征去噪后,可以取得更低的音素错误率。在0 dB白噪声环境下采用THCHS30数据集进行实验验证。结果表明,Fbank去噪前,所提A-R-B-CTC模型相比于BLSTM-CTC、ResNet-BLSTM-CTC、Transformer模型,平均音素错误率分别降低了4.38%、2.5%、1.96%;Fbank去噪后,4种模型的音素错误率明显下降,其中所提A-R-B-CTC模型相比于其他3种模型性能依旧出色。此外,在其他信噪比下也达到了不错的效果。

关 键 词:音素识别  Wave-U-Net  端到端  多头自注意力机制  Transformer模型
收稿时间:2023-06-10
修稿时间:2024-05-10

Phoneme recognition method based on fusion network with low signal-to-noise ratio
HUANG Huibo,SHAO Yubin,LONG Hu,DU Qingzhi. Phoneme recognition method based on fusion network with low signal-to-noise ratio[J]. Journal of Chongqing University of Posts and Telecommunications, 2024, 36(4): 786-796
Authors:HUANG Huibo  SHAO Yubin  LONG Hu  DU Qingzhi
Affiliation:Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, P. R. China
Abstract:Aiming at the problem of low accuracy of phoneme recognition under low signal-to-noise ratio, a new recognition method is proposed. Firstly, the Fbank features of speech are extracted and input into the A-R-B-CTC model constructed by multi-head attention mechanism,ResNet, BLSTM, and CTC for phoneme recognition. Then, the image denoising of the speech features Fbank, MFCC, GFCC, and logarithmic spectrum is performed by utilizing Wave-U-Net, and it is found that the denoising of the Fbank features results in a more lower phoneme error rate. The THCHS30 dataset is used for experimental validation in a 0 dB white noise environment. The results show that before Fbank denoising, the proposed A-R-B-CTC model reduces the average phoneme error rate by 4.38%, 2.5%, and 1.96% compared to the BLSTM-CTC, ResNet-BLSTM-CTC, and Transformer models, respectively; after Fbank denoising, the phoneme error rates of the four models are significantly reduced, and the proposed A-R-B-CTC model still performs well compared to the other three models. In addition, good results are also achieved at other signal-to-noise ratios.
Keywords:phoneme recognition  Wave-U-Net  end-to-end  multi-headed self-attention  transformer model
点击此处可从《重庆邮电大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《重庆邮电大学学报(自然科学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号