AI, Can you identify who is speaking?
- On June 13, 2021
- speaker recognition
The voice of each person sounds different and has its own unique characteristics. These characteristics can be divided into linguistic vs. non-linguistic and also to auditory vs. acoustic ones. This fact is being used for decades in forensic cases. When needed, the speech is being analyzed using different methods to provide forensic proof. For example, the auditory analysis will include linguistic comparison of samples, on the other hand acoustic analysis will focus on the differences caused by the structure of the vocal tract of each person.
The next evolution in this field was to allow machines to automatically identify the speaking person. The primary usage was biometric identification. Voice recognition was added to the biometric toolset together with finger prints, iris recognition, facial recognition and more. A typical usage scenario goes like this: a customer of an organization is making a phone call to the call center, the biometric engine starts learning his/her voice. After the biometric engine gathered enough audio samples of this person it will now be used to identify the customer in new calls replacing part of the legacy identification methods. In this case, the biometric engine can usually make a decision after listening to few seconds and prompt to the agent or system if the customer was successful identified.
The next question that comes into minds, can machines be used to cancel ambient human noises during phone calls? In other words, can the machine learn the characteristics of the speaking person and attenuate all other people talking in the background? The answer to this question is usually negative. The reason lies in the minimal allowed delay. During real-time phone calls the allowed delay is very minimal, usually few milliseconds, and during this minimal time there are not enough voice characteristics that can be learned and used to distinguish between speakers.