detect number of peaks in audio recording

Question

I'm trying to figure out how to detect the number of syllables in a corpus of audio recordings. I think a good proxy might be peaks in the wave file.

Here's what I tried with a file of me speaking in English (my actual use case is in Kiswahili). The transcript of this example recording is: "This is me trying to use the timer function. I'm looking at pauses, vocalizations." There are a total of 22 syllables in this passage.

wav file: https://www.dropbox.com/s/koqyfeaqge8t9iw/test.wav?dl=0

The seewave package in R is great, and there are several potential functions. First things first, import the wave file.

library(seewave)
library(tuneR)
w <- readWave("YOURPATHHERE/test.wav")  
w
# Wave Object
# Number of Samples:      278528
# Duration (seconds):     6.32
# Samplingrate (Hertz):   44100
# Channels (Mono/Stereo): Stereo
# PCM (integer format):   TRUE
# Bit (8/16/24/32/64):    16

The first thing I tried was the timer() function. One of the things it returns is the duration of each vocalization. This function identifies 7 vocalizations, which is far short of 22 syllables. A quick look at the plot suggests that vocalizations do not equal syllables.

t <- timer(w, threshold=2, msmooth=c(400,90), dmin=0.1)
length(t$s)
# [1] 7

I also tried the fpeaks function without setting a threshold. It returned 54 peaks.

ms <- meanspec(w)
peaks <- fpeaks(ms)

This plots amplitude by frequency rather than time. Adding a threshold parameter equal to 0.005 filters out noise and reduces the count to 23 peaks, which is pretty close to the actual number of syllables (22).

I'm not sure this is the best approach. The result will be sensitive to the value of the threshold parameter, and I have to process a big batch of files. Any better ideas about how to code this to detect peaks that represent syllables?

This is a very interesting question, but you might get better help on methods over at the [Stack Exchange Signal Processing Q&A site](http://dsp.stackexchange.com/). — eipi10, Apr 14 '16 at 19:23
ok, thanks. will check it out if no one responds. much appreciated. — Eric Green, Apr 14 '16 at 19:30
Just an idea, but would it be worthwhile to consider undertaking [change point analysis](http://www.variation.com/cpa/tech/changepoint.html)? The analysis can be undertaken easily in **R** with use of the [**`changepoint`**](https://cran.r-project.org/web/packages/changepoint/changepoint.pdf) package. Simply put, the change point analysis focuses on *detecting change,* the linked example is concerned with trade data but it could be interesting to apply this technique to sound data. — Konrad, Apr 29 '16 at 19:00
I'm going to accept the answer that has the most votes, which happens to be my attempt to implement another CV idea. I think the core question remains however: how to use features of the recordings to accurately detect a number of peaks that corresponds to the number of syllables spoken. Thank you for all of the ideas. I will post back here when I have a solution. — Eric Green, May 01 '16 at 18:21

score 6 · Accepted Answer · edited Apr 13 '17 at 12:44

6

I don't think what follows is the best solution, but @eipi10 had a good suggestion to check out this answer on CrossValidated. So I did.

A general approach is to smooth the data and then find peaks by comparing a local maximum filter to the smooth.

The first step is to create the argmax function:

argmax <- function(x, y, w=1, ...) {
  require(zoo)
  n <- length(y)
  y.smooth <- loess(y ~ x, ...)$fitted
  y.max <- rollapply(zoo(y.smooth), 2*w+1, max, align="center")
  delta <- y.max - y.smooth[-c(1:w, n+1-1:w)]
  i.max <- which(delta <= 0) + w
  list(x=x[i.max], i=i.max, y.hat=y.smooth)
}

Its return value includes the arguments of the local maxima (x)--which answers the question--and the indexes into the x- and y-arrays where those local maxima occur (i).

I made minor modifications to the test plotting function: (a) to explicitly define x and y and (b) to show the number of peaks:

test <- function(x, y, w, span) {
  peaks <- argmax(x, y, w=w, span=span)

  plot(x, y, cex=0.75, col="Gray", main=paste("w = ", w, ", span = ", 
                                              span, ", peaks = ", 
                                              length(peaks$x), sep=""))
  lines(x, peaks$y.hat,  lwd=2) #$
  y.min <- min(y)
  sapply(peaks$i, function(i) lines(c(x[i],x[i]), c(y.min, peaks$y.hat[i]),
                                    col="Red", lty=2))
  points(x[peaks$i], peaks$y.hat[peaks$i], col="Red", pch=19, cex=1.25)
}

Like the fpeaks approach I mentioned in my original question, this approach also requires a good deal of tuning. I won't know the "right" answer (i.e., the number of syllables/peaks) going into this, so I'm not sure how to define a decision rule.

par(mfrow=c(3,1))
test(ms[,1], ms[,2], 2, 0.01)
test(ms[,1], ms[,2], 2, 0.045)
test(ms[,1], ms[,2], 2, 0.05)

At this point fpeaks seems a little less complicated to me, but still not satisfying.

edited Apr 13 '17 at 12:44

Community

1

answered Apr 14 '16 at 21:01

Eric Green

629
9
20

It might be unsatisfying because your loess parameters do not do enough smoothing. The choice of smoother needs to be guided by the nature of the data and the objectives; it is not something to be left to whatever is offered by the computing platform and the default values it supplies. – whuber Apr 15 '16 at 16:27
These are not defaults. Just examples. I'm puzzled by the larger challenge of unsupervised learning in this case. I don't know the number of syllables in the recordings, so I'm not sure how to tune a batch of files. Constant parameters probably don't make sense, but I'm not sure how to set up some other decision rules (e.g., other metrics of the wave that could be used to determine optimal values for these parameters). I'm thinking I need to create a training set that helps some algorithm set these parameters. Not sure though. – Eric Green Apr 15 '16 at 16:39
In your command to `loess`, I see no arguments explicitly given for the degree of smoothing. Actually, there's little point to running loess over a moving window: it already does that internally. – whuber Apr 15 '16 at 16:41
I see your point. I assumed that `w` was an argument in the smoothing. This is [how the author of the original solution](http://stats.stackexchange.com/questions/36309/how-do-i-find-peaks-in-a-dataset) described the function: "There are two parameters to be tuned to the circumstances: w is the half-width of the window used to compute the local maximum...Another--not explicit in this code--is the span argument of the loess smoother." – Eric Green Apr 15 '16 at 16:47
That author included `w` as one of the parameters because he had in mind a very general approach in which the smoother might not be loess but perhaps would be a windowed median, or Hanning, or anything else deemed appropriate for the statistical behavior of the data and the objectives of the analyst. The properties of many of those smoothers would depend on the width of the window. – whuber Apr 15 '16 at 17:03
That's helpful. Thanks. I need to better understand the parameters. That said, I think my earlier comment on the unsupervised nature of the task is still my main source of confusion. – Eric Green Apr 15 '16 at 17:06
Although the task may be unsupervised, it is likely that much can be learned from a few representative sets of data about common statistical features of all future circumstances, which you can exploit in your choice of smoother. "Unsupervised" does not have to mean "mindless" (although I'm sure you didn't intend to imply that)! – whuber Apr 15 '16 at 18:21

score 1 · Answer 2 · answered Apr 19 '16 at 19:15

I had similar problems to analyse protein electrophoresis profiles. I solved them by applying some of the functions of the msprocess R package on the second derivates of the profiles (see https://fr.wikipedia.org/wiki/D%C3%A9pouillement_d'une_courbe#Position_et_hauteur_du_pic). This has been published here: http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12389/abstract;jsessionid=8EE0B64238728C0979FF71C576884771.f02t03

I have no idea whether similar solution can work for you. Good luck

thanks, @user17493.bis. kudos to you for publishing with supplemental material. will make it so much easier for me to give this idea a try! — Eric Green, Apr 19 '16 at 19:37

Konrad · Answer 3 · 2016-04-30T09:28:26.347

1

I would like to suggest a solution utilising the changepoint package. The simplistic example below attempts to identify peaks, defined here as change points by looking at one channel from the available data.

Example

Data sourcing

# Libs
library(seewave)
library(tuneR)

# Download
tmpWav <- tempfile(fileext = ".wav")
download.file(url = "https://www.dropbox.com/s/koqyfeaqge8t9iw/test.wav?dl=0",
              destfile = tmpWav)

# Read
w <- readWave(filename = tmpWav)

Data preparation

# Libs
require(changepoint)

# Create time series data for one channel as an example
leftTS <- ts(data = w@left)

## Preview
plot.ts(leftTS)

Chart generated via the plot.ts call:

Change-point analysis

The changepoint package provides a number of option for identifying changes/peaks in the data. The code below provides only a simple example of finding 3 peaks using BinSeg method:

# BinSeg method (example)
leftTSpelt <- cpt.var(data = leftTS, method = "BinSeg", penalty = "BIC", Q = 3)
## Preview
plot(leftTSpelt, cpt.width = 3)

Obtained chart: It is also possible to get values:

cpts(leftTSpelt)
[1]  89582 165572 181053

Side notes

The provided example is mostly concerned with illustrating how the change point analysis can be applied to the provided data; caution should be exercised with respect to parameters passed to the cp.var function. A detailed explanation of the package and the available functionalities is given in the following paper:

Killick, Rebecca and Eckley, Idris (2014) changepoint:an R package for changepoint analysis. Journal of Statistical Software, 58 (3). pp. 1-19.

`ecp`

ecp, is another worth mentioning R package. The ecp facilitates undertaking non-parametric multivariate change point analysis, which may be useful if the one would like to identify change points occurring across multiple channels.

edited Apr 30 '16 at 09:28

answered Apr 30 '16 at 09:17

Konrad

137
8

Thanks, @konrad. I did not know about either package, so thanks for taking the time to demo. I think the fundamental challenge I have with all of these packages is that I do not know how many peaks to look for, so I'm not sure how to tune the parameters. This still seems like a situation where I have to use some algorithm to determine how to set the parameters to accurately identify the correct number of peaks (i.e., syllables). – Eric Green Apr 30 '16 at 18:18
@EricGreen On principal the change point analysis would enable you to identify your peaks just by looking at the distribution. It would be a matter of applying a suitable method, penalties and so on. I would suggest that you have a look at the website linked in my previous comment as it outlines the process in detail. – Konrad Apr 30 '16 at 18:24
I'm not sure if you literally mean eyeballign the distribution. I have 2000 files and need a way to automate this. Even if I could examine each file, I find it hard to see the number of syllables as peaks. Maybe I am being dense and I'll come to see the merits of this approach. I'm still stuck on needing to a way to auto tune the parameters of each file so the resulting number of peaks detected is an accurate proxy for the number of syllables. – Eric Green Apr 30 '16 at 18:41
@EricGreen No, not literary of course. If you figure out on the appropriate parameters that should be passed to one of the *cpt* functions you will be able to run it across any number of objects. As I've no expertise in linguistics I don't know whether syllables would correspond to the usual peaks observed on time series data. – Konrad Apr 30 '16 at 19:19
gotcha. I think I'm stumbling on the "figure out the appropriate parameters" step for this particular use case. But I've appreciated all of the ideas and learned about a few new packages that could be good alternatives to the ones I tried. – Eric Green Apr 30 '16 at 19:38
@EricGreen Ideally, if you have a file for which you know how many peaks you would expect, you could attempt to find configuration that yields the desired results and then apply it across the remaining files. – Konrad Apr 30 '16 at 19:40
I think you're right. I'm planning to create a test set of files and count the syllables. I'm just not certain yet how to train an algorithm to set the optimal parameters for detecting a number of peaks that corresponds to the syllable count. – Eric Green Apr 30 '16 at 19:44

score 0 · Answer 4 · answered Apr 28 '16 at 19:06

Here is a library in Python I used earlier while trying to estimate periodicity by finding peaks in the autocorrelation function.

It uses first-order differences/discrete derivatives for peak detection and supports tuning by threshold and minimum distance (between consecutive peaks) parameters. One can also enhance the peak resolution using Gaussian density estimation and interpolation (see link).

It worked quite well out of the box for me without much tweaking, even for noisy data. Give it a try.

Thanks, @tool.ish. It looks like a good alternative to the R methods I cited. I think I'd still have the tuning challenge, however. — Eric Green, Apr 29 '16 at 10:46