
ASR in a Noisy Environment
- On September 19, 2023
- AI, ASR, contact center
Automatic Speech Recognition (ASR) is a powerful technology that uses advanced machine learning and artificial intelligence techniques. ASR improves the quality of service while reducing its cost. Modern ASR engines are being trained to recognize different languages and dialects. They can also be trained for an industry-specific vocabulary. Overall, modern ASR engines are very accurate provided that they are used in a relatively quiet environment. Unfortunately there are many cases in which the speech is recorded in a noisy environment and as a result the accuracy of the ASR engine is reduced. In this post we will discuss few such cases and possible resolution.

Smart TV , IVR and more
Let’s take a look at the following scenario: the TV is on playing some show. In addition, in the background there is a dog barking. Our user is holding the remote control and trying to say a voice command. As you can see in this scenario in addition to the voice of our user the microphone also picks the noise coming from the TV and the noise of the barking dog. In order to reduce the noise in this case and improve SNR (speech to noise ratio), let’s review the two different sources of noise as follows:
- Sound coming from the TV. The sound coming from the TV and captured by the remote, can be viewed as echo. A robust echo cancellation that can handle volatile parameters should be able to remove the sound of the TV from the audio that is captured by the microphone.
- Sound of the dog barking. This sound can be attenuated by an advance noise reduction algorithm that can distinguish between human voices and non-human sounds and attenuate the latter.
This Smart-TV scenario is also applicable for other cases like Kiosk, Intercom, call entering IVR from a noisy street etc.
Call Center
In a call center environment both the complexity and the importance of noise removal is higher since the main source of noise in the call center is human voices of other agents talking in the background. As a result not only that the ambient noise can not be identified and removed as being a non-human sound, this ambient human voices also dramatically reduce the accuracy of voice analytics since no ASR or emotion detection engine can distinguish between the voice of the primary agent and the voice of the agent sitting next him/her. To resolve this problem, you should use a technology, like the Noise-Firewall, that can remove ambient human voices. You are invited to take a look at this case study to experience the difference.
Conclusion
If you want to deploy ASR in a noisy environment, it is not enough to implement the best in class ASR engine. Regardless the quality of this engine, if you provide to it a noisy input you should expect low-quality results (“garbage-in garbage-out”). In such a case, you should first analyze the source of noise and then consider using a technology to eliminate the noise before it is provided as input to the ASR.