Questions tagged [speech-recognition]

Automatic speech recognition (ASR) aims to identify words and phrases in spoken language and convert them to a machine-readable format.

53 questions
7
votes
1 answer

Turkish speech recognition (speech->text) in Google Speech API?

Google's Speech API has audio speech to text capabilities in multiple languages. It supports Turkish too. That language is very interesting, it's so called agglutinative: you stick word parts one after another instead of prepositions and other parts…
Aksakal
  • 55,939
  • 5
  • 90
  • 176
3
votes
0 answers

The state-of-the-art methods for speech recognition?

Recently I noticed that Google and Apple have really high quality speech-recognition services. I was wondering about the state-of-the-art methods and techniques they are/might be using to achieve such quality. I already know that Hidden Markov…
3
votes
0 answers

Validation loss is less than training loss by 5 units. How this result is interpreted?

Iam training a Keras model for end-to-end speech recognition. I have my own dataset of speech containing about 400 wave files. Text transcriptions is also given as input. Model summary is: Layer (type) Output Shape Param # the_input (InputLayer)…
3
votes
0 answers

Program to evaluate the output of a speech recognition system

I am looking for a library, script or program that can evaluate the output of a speech recognition system. The output of the speech recognition system is a simple text file, and I have the gold output in the same format. I have crossposted the…
2
votes
0 answers

Confusion about the derivative in CTC

I was going through the original CTC paper by Graves et al, I am still not getting how after taking the derivative of equation 14 we get equation 15 as shown below I understand the part that we are considering only those paths that involve the label…
2
votes
1 answer

Mismatching dimensions of input/output in the WaveNet model for text-to-speech generation?

I have been trying to understand the model of how speech generation works, particularly in WaveNet model by Google. I was referring to the original WaveNet paper and this implementation: I find the model very confusing in the input it takes and the…
2
votes
1 answer

Method for detecting previously unseen class

Is there any common practice for detecting a new class, or data associated with an previously unseen event? I'm doing some research into speech recognition, and I'm trying to detect when a speech recognizer encounters a speaker it hasn't seen…
Cerin
  • 644
  • 7
  • 16
2
votes
1 answer

When use CTC-loss for speech recognition?

I'm trying to understand and implement CTC-loss for speech recognition (here on SO). I'll like to have more information about the use cases of this technique. From what i understood, it is more dedicated to understand sentences (e.g. "Please close…
2
votes
1 answer

Streaming audio to neural network

I am trying to create a neural network that performs speaker recognition. I would like to be able to serve it such that it takes streaming audio - i.e. I want to perform partial recognition on 100ms frames and then calculate an average at the end. I…
2
votes
1 answer

How to use GMMs for acoustic signal classification?

There are a number of applications of the Gaussian Mixture Model (GMMs) to acoustics/audio data for the purposes of classification; ex paper1 and ex paper2. GMMs for the case of clustering and position source generation can be understood. What is…
2
votes
1 answer

Verifying Time Warp

Time warp has been widely assumed in domain of speech processing. If $Xw(t)$ represents a time warped version of $X(t)$, then $Xw(t) = X(t-w(t))$ where $w(t)$ is an arbitrary function with a banded derivation. I think it has a direct relationship…
2
votes
1 answer

Word Error Rate over Data Set

In speech to text, one common metric is the word error rate (WER). WER is the word-level Levenshtein distance, which is the minimum number of substitutions ($S$), deletions ($D$), and insertions ($I$) to modify the prediction to the ground truth…
2
votes
1 answer

MFCCs and MoG-HMMs for speech recognition

BACKGROUND MFCCs are coefficients which represent the most important parts of speech, and about 12 of them are used to model a one 512 points long frame (of speech). Along with them you would use delta coeffients, which track the change of MFCCs…
2
votes
2 answers

Why has deep learning only shown decent results in the fields of computer vision and speech recognition?

We all know about the success of ImageNet, AlphaGo etc which used deep neural networks in computer vision, or the use of RNNs in Google Translate. But why are we not seeing similar advances in other fields like finance?
1
vote
0 answers

Hidden Markov models in Speech Recognition

My first question here. So I am trying to build a sign language translator(from signs to text) and noticed that the problem itself is quite similar to speech recognition, so I started to research about that. Right now one thing is I can't figure out…
1
2 3 4