7

Google's Speech API has audio speech to text capabilities in multiple languages. It supports Turkish too. That language is very interesting, it's so called agglutinative: you stick word parts one after another instead of prepositions and other parts in languages like English. This leads to pretty much unlimited size vocabulary.

Do you know how Google implemented Turkish speech recognition for their API? I can't believe they used the same techniques as in English.

UPDATE

Here's an example transcript that Google API returned from the following clip on YouTube:

you would have to ask him I have no clue Yahoo answers I was Adam Scott really in Jumanji in The Truman Show I looked him up on iTunes it said under movies her is in was Jumanji and The Truman Show I don't * * * * believe it will listen I'm not in either of those movies so yeah you really shouldn't * * * *

I think it's excellent quality of transcription. I used my beautiful AudioEngine monitors and put a crappy 20 years old LabTec computer mic in front of it. A truly amateur setup, but that's how these things will be used in practice, i.e. in less than ideal situation.

Here's an example from a Turkish movie scene:

merhaba Temmuz Ben hoş geldin kardeş e nasılsınız keyifler iyidir inşallah İyi valla koşturuyoruz nasıl olsun Hem kardeş lafı uzatmadan konuya girsek anlattı bana ikinci el işçiliği Tabii sen güzel bir şey yapıyor Dernek falan da işte ilişkin bir delikanlı eve gelip gidiyor

This one is basically incomprehensible. It picks up some words here and there, but it's hard to connect them unlike in English example.

Does this mean that Google is not using a custom solution for Turkish? Maybe they want for repurposing their English language engines for Turkish ?

Just for fun, I sent a clip from Azeri language speaker. He's speech is clearly enunciated but the API barely got a few words. I used Turkish setting, so it's not fair, really, but the languages are similar:

o akşam Çağlayan Doruk sevgilin kim bu kim baktı Bülent Serttaş çok pis

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • 2
    A Google Scholar search https://scholar.google.com/scholar?hl=en&as_sdt=0%2C47&q=speech+recognition+turkish&btnG= doesn't reveal much that specifically addresses the peculiarities of Turkish, but tucked away in the search are a number of articles about language-agnostic speech recognition using machine learning. It seems plausible that Google combined a technique to agnostically extract phonemes with a method to transcribe the same into a particular language. – Sycorax Mar 23 '18 at 20:32
  • 2
    [Linguistics.SE](https://linguistics.stackexchange.com/) may be more helpful here, specifically their [speech-recognition tag](https://linguistics.stackexchange.com/questions/tagged/speech-recognition). – Stephan Kolassa Mar 23 '18 at 20:40
  • 3
    (+1) There is a potentially interesting *sociological* question here, too, insofar as in my anecdotal experience, Turkish engineers/researchers are quite overrepresented in many of the leading machine-learning speech-recognition teams in industry. – cardinal Mar 23 '18 at 21:20
  • 1
    If you want to compare the word error rate of different APIs for speech recognition: https://github.com/Franck-Dernoncourt/ASR_benchmark – Franck Dernoncourt Apr 06 '18 at 22:33

1 Answers1

3

What is used in production is often not disclosed. I'm not aware of Google disclosing how the current automated speech recognition (ASR) system they using production works. One way to approximate it would be to scan ICASSP/Interspeech/etc. proceedings for Google publications.

Anyway, putting Google aside: the question can be generalize as "How to perform ASR in languages with large or open ended dictionaries?".

One way to do so is to use sub-word language modeling, e.g. from {1}:

Abstract: In this study, some solutions for out of vocabulary (OOV) word problem of automatic speech recognition (ASR) systems which are developed for agglutinative languages like Turkish, are examined and an improvement to this problem is proposed. It has been shown that using sub-word language models outperforms word based models by reducing the OOV word ratio in languages with complex morphology.

or from {2}:

Abstract: Turkish speech recognition studies have been accelerated recently. With these efforts, not only available speech and text corpus which can be used in recognition experiments but also proposed new methods to improve accuracy has increased. Agglutinative nature of Turkish causes out of vocabulary (OOV) problem in Large Vocabulary Continuous Speech Recognition (LVCSR) tasks. In order to overcome OOV problem, usage of sub-word units has been proposed. In addition to LVCSR experiments, there have been some efforts to implement a speech recognizer in limited domains such as radiology. In this paper, we will present Turkish speech recognition software, which has been developed by utilizing recent studies. Both interface of software and recognition accuracies in two different test sets will be summarized. The performance of software has been evaluated using radiology and large vocabulary test sets. In order to solve OOV problem practically, we propose to adapt language models using frequent words or sentences. In recognition experiments, 90% and 44% word accuracies have been achieved in radiology and large vocabulary test sets respectively.


References:

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271