Are MFCCs the optimal method of representing music to a retrieval system?

Question

A signal processing technique, the Mel frequency Cepstrum, is often used to extract information from a musical piece for use in a machine learning task. This method gives a short-term power spectrum, and the coefficients are used as input.

In designing music retrieval systems, such coefficients are considered characteristic of a piece (obviously not necessarily unique, but distinguishing). Are there any characteristics that would better suit learning with a network? Would time-varying characteristics like the bass progression of the piece used in something like an Elman network work more effectively?

Which characteristics would form an extensive enough set upon which classification could take place?

@AndrewRosenberg More along the lines of identifying similar music. — jonsca, Feb 14 '12 at 13:49
Are you working on retrieval, where you're looking for unique qualities of a particular audio clip? or do you want to identify similar music? — Andrew Rosenberg, Feb 14 '12 at 13:30
(Years later), there are many many ways to tinker with MFCC; Kinunnen et al., [Frequency Warping and Robust Speaker Verifications: A Comparison of Alternative Mel-Scale Representations](http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/melwarp_IS2013.pdf) 2013, 5p, use 60 coefficients. And, optimize what ? On what non-open database ? So I'd say (non-expert) that the question is too broad to be answerable. — denis, Jan 16 '14 at 17:39
@denis Thanks for the information. This came over from the ill-fated Machine Learning Beta (the first time around). I appreciate that it is a bit vague. — jonsca, Jan 16 '14 at 23:57

score 8 · Accepted Answer · answered Feb 13 '12 at 10:53

8

We did a bit of work on this at one point. The set of features we extracted are given in this NIPS workshop paper. I have to admit we couldn't replicate the results of some other authors in the field, although there were some doubts about the datasets used in these (note that the datasets used by authors in this field tend to be hand-picked and not released to the public, for copyright reasons, although this not always the case). Essentially they were all short-term spectral features with Autoregression coefficients thrown in too. We were looking at classification of genre, which we know can be done by humans (although not with wonderful accuracy, and not with consistent agreement ....) in very short timespans (<1s), which validates the use of short term features. If you're interested in doing more complicated things than the typical genre/artist/album/producer classification then you might need more long-range features, otherwise these short-term spectral features tend to perform best.

answered Feb 13 '12 at 10:53

tdc

7,289
5
32
62

What was the purpose of throwing in the AR coefficients? – jonsca Feb 13 '12 at 11:20
1

@jonsca Since we were using boosting methods, which work by combining many "weak" learners, we decided to use any features that could be easily calculated that could provide some benefit. All that is required of a weak learner for it to be useful is that it can classify at greater than chance levels. The AR coefficients are equivalent to a compression of spectral envelope, which gives some notion of the short-term information complexity of the music within that window, although only very loosely. – tdc Feb 13 '12 at 11:27
@tdc, "datasets tend to be not released to the public ...": would you know of any free online datasets of speech, with phonemes labelled ? – denis Jan 15 '14 at 15:39
@denis the only one I know of is this one: http://orange.biolab.si/datasets/phoneme.htm – tdc Jan 15 '14 at 15:42
@tdc, thanks, but that's only 11 vowels from Elements of stat learning, ~ 1000 x 11 features (ancient LPC). – denis Jan 16 '14 at 16:27

Are MFCCs the optimal method of representing music to a retrieval system?

1 Answers1