My code uses Google Voice API, to detect what one person said. For example, if I say one, two, three
on my microphone the Google's API returns to me you probably said: one, two, three
. But when my brother and I speak at the same time at the microphone it not works. For example: I say: ' one, two, three'
and my brother at the same time say: 'hello, testing, sound
. The Google' API returns the words of the the speaker who spoke louder. If I say something louder than my brother it returns the words that I said, otherwise, if my brother speaks louder than me, the Google's API returns what my brother say. So I want to use an algorithm that detects all different audio fonts in a audio file, and then process each audio font using Google' API. It's not necessary to detect 'who spoke and when'. For example:
Audio File over time:
I said --> one two three
My brother said --> hello testing audio
Time in seconds:--> 1---1.5---2---2.5---3--3.5--4--4.5
So the algorithm must do the following approaches:
audio = one hello testing three audio
or
audio = one two three hello testing audio
or
my_audio = one two three
my_brother_audio = hello testing audio
And finally send this processed audio to Google' API.
How can I do this? What algorithm should I use to make it possible?