I'm trying to figure out how to detect the number of syllables in a corpus of audio recordings. I think a good proxy might be peaks in the wave file.
Here's what I tried with a file of me speaking in English (my actual use case is in Kiswahili). The transcript of this example recording is: "This is me trying to use the timer function. I'm looking at pauses, vocalizations." There are a total of 22 syllables in this passage.
wav file: https://www.dropbox.com/s/koqyfeaqge8t9iw/test.wav?dl=0
The seewave
package in R is great, and there are several potential functions. First things first, import the wave file.
library(seewave)
library(tuneR)
w <- readWave("YOURPATHHERE/test.wav")
w
# Wave Object
# Number of Samples: 278528
# Duration (seconds): 6.32
# Samplingrate (Hertz): 44100
# Channels (Mono/Stereo): Stereo
# PCM (integer format): TRUE
# Bit (8/16/24/32/64): 16
The first thing I tried was the timer()
function. One of the things it returns is the duration of each vocalization. This function identifies 7 vocalizations, which is far short of 22 syllables. A quick look at the plot suggests that vocalizations do not equal syllables.
t <- timer(w, threshold=2, msmooth=c(400,90), dmin=0.1)
length(t$s)
# [1] 7
I also tried the fpeaks function without setting a threshold. It returned 54 peaks.
ms <- meanspec(w)
peaks <- fpeaks(ms)
This plots amplitude by frequency rather than time. Adding a threshold parameter equal to 0.005 filters out noise and reduces the count to 23 peaks, which is pretty close to the actual number of syllables (22).
I'm not sure this is the best approach. The result will be sensitive to the value of the threshold parameter, and I have to process a big batch of files. Any better ideas about how to code this to detect peaks that represent syllables?