Cocktail party and crosstalk
- On January 28, 2024
- AI, contact center, crosstalk, noise reduction
The cocktail party problem is one of the oldest problems in the arena of signal processing. The problem is describes as follows: in a cocktail party there are many people talking. All their voices are mixed in the air and the people listening will hear all the voices blended together. Nevertheless, in most cases, the human brain can focus on the voice of a specific speaker while ignoring all other voices so basically the human brain estimates individual source from the mixture of all sources. In electronics this scenario is called crosstalk in which a transmitted (voice) signal in one channel creates undesired effects/disruptions in other channels. The first question that comes into mind is how can an artificial brain (a.k.a. AI – artificial intelligence) perform similarly and separate the mix of all audio signals to the different sources (a.k.a. source separation problem).
The cocktail party introduces a big problem to AI algorithms. AI algorithms can efficiently identify and extract a human voice from a noisy environment containing non-human sounds like dog barks or music. But, when it comes to identifying and extracting a specific human voice from a mixture that contains other human voices things get complicated for a single-source AI algorithm.
So, how does our brain solve the problem? Our brain uses multiple input signals that help it identify and extract a single conversation from all the rest. In addition to analyzing the audio it will also use information like: the direction of each voice which is estimated using our two ears, it might also use basic lip reading based on input it gets from our eyes etc. As we can see solving this problem requires more than a single input and only an artificial brain that was trained to handle multiple input signals can solve this problem. The Noise Firewall solves the cocktail party problem using a similar technique. The Noise Firewall listens to multiple audio signals coming from multiple directions then it correlates them in real time and identifies the physical location of each speaker, builds a Noise Map and extracts the voice of a single speaker from the voices of all other speakers.
At this point we would like to differentiate the cocktail party problem from a scenario in which there is a main voice which is significantly louder than the secondary/background voice. The latter scenario can be found at call centers where the agents are sitting at some distance from each other and are using headsets and therefore the primary agent is heard on the call much more loudly than the secondary/background agent. In this scenario an AI brain with a single input source can focus on the primary speaker by relying on the fact that there is a significant volume difference between the primary speaker and the secondary speaker.
To summarize, artificial intelligence algorithm that listens to a single audio channel will be able to easily extract human voice from a noisy channel that contains non-human sounds. But, when it comes to complex scenarios like the cocktail party or any other crowded location in which there are multiple people talking, a more advanced algorithm is required. Such an algorithm, like our brain, should be be able to receive as input multiple input signals and use them in order to identify and extract a single human voice from the rest.