Boost Speech to Text Accuracy
- On July 4, 2022
- speech to text
Natural Language Processing (NLP) is a branch in science that is gaining momentum in the last years. Its holy grail is to build a machine that understand and respond to text or voice data the same way that humans do. NLP has gained momentum due to two main correlated reasons: (1) the increasing usage of Artificial Intelligence (AI) and (2) the increasing processing power available in modern computers. Speech to Text (STT) is an important component in NLP. It is actually an enabler component for NLP when it comes to understanding speech.
One of the main disturbances that dramatically reduces the accuracy of STT is the presence of noise. The noise can include voice of other people and also non-human sounds like car horn. Even if the magnitude of the ambient noise is not too high to impact the intelligibility for the human ear, it can still dramatically reduce the accuracy for an AI based STT engine. For example, you can take a look at the cocktail party scenario.
The outcome of the above is that in order to improve the accuracy of STT, the input audio signal of the STT should first be filtered by a noise cancelling application in order to clean it from the background noise. As a result, the STT engine will receive as input a clean audio signal and its accuracy is expected to improve. But will this always be the case?
Noise cancelling software can surely remove the noise but the more aggressive noise cancellation is performed, more distortion to the original speech might be introduced. As you can see, there is no one size fits all and the optimal aggressive level for an STT engine might change per audio stream and the specifications of the STT engine. A practical solution might be to filter the income audio stream with more than one aggressive level and choose the results for which the STT engine provides the highest score. How can this be done? In order to give a score to the output of an STT algorithm, we can check its grammar and structure. We can do it using any grammar check. For example, language_check is a python library that checks the sentence and also suggests ways to correct the sentence. It is also possible to use textblob library to give the sentence a score based on sentiment. In addition, there are options to build a dedicated heuristic and/or take into account “important words” in the context.
To conclude this post, we have seen that pre-filtering the audio stream with a noise cancelling application before running an STT algorithm can increase the accuracy of the STT algorithm. The noise cancelling application should be flexible enough to enable multiple aggressive levels to be run in parallel in order to select the optimal level per scenario. If this activity is required to run on a centralized server, the noise cancelling application must be efficient, from both CPU and memory perspective, and should not require GPU resources.