New revolutionary way of data mining?

Question

The following excerpt is from Schwager's Hedge Fund Market Wizzards (May 2012), an interview with the consistently successful hedge fund manager Jaffray Woodriff:

To the question: "What are some of the worst errors people make in data mining?":

A lot of people think they are okay because they use in-sample data for training and out-of-sample data for testing. Then they sort the models based on how they performed on the in-sample data and choose the best ones to test on the out-of-sample data. The human tendency is to take the models that continue to do well in the out-of-sample data and choose those models for trading. That type of process simply turns the out-of-sample data into part of the training data because it cherry-picks the models that did best in the out-of-sample period. It is one of the most common errors people make and one of the reasons why data mining as it is typically applied yields terrible results.

The interviewer than asks: "What should you be doing instead?":

You can look for patterns where, on average, all the models out-of-sample continue to do well. You know you are doing well if the average for the out-of-sample models is a significant percentage of the in-sample score. Generally speaking, you are really getting somewhere if the out-of-sample results are more than 50 percent of the in-sample. QIM's business model would never have worked if SAS and IBM were building great predictive modeling software.

My questions
Does this make any sense? What does he mean? Do you have a clue - or perhaps even a name for the proposed method and some references? Or did this guy find the holy grail nobody else understands? He even says in this interview that his method could potentially revolutionize science...

Is he not simply discussing errors from a single split-sample (train and validation) and advocating a nested cross validation process? — B_Miner, Jul 02 '12 at 14:23
I would be wary of *anyone* claiming some deep insight that would revolutionize "science". — cardinal, Jul 02 '12 at 14:36
Hedge fund managers claiming a "better modeling approach" and doing a bit of trash talking of the competition? Nothing new there. — zbicyclist, Jul 04 '12 at 00:35
wow, how is this question getting so many upvotes? Out of sample prediction is an issue that's discussed on the first day of any introductory machine learning course. There are those who don't address out-of-sample predictions correctly, but certainly nobody that has even the slightest clue about the task of prediction. — user4733, Jul 30 '12 at 18:01
Trading is of course a time-sereis problem, what he is saying seems to be that cross-validation (of course using known data) cannot solve the problem of __structure__ changing with time!, so is not a holy grail. But what he is actually doing cannot be inferred. — kjetil b halvorsen, Jan 13 '13 at 12:53

score 8 · Answer 1 · edited Jul 30 '12 at 17:47

Not sure if there'll be any other "ranty" responses, but heres mine.

Cross Validation is in no way "new". Additionally, Cross Validation is not used when analytic solutions are found. For example you don't use cross validation to estimate the betas, you use OLS or IRLS or some other "optimal" solution.

What I see as a glaringly obvious gap in the quote is no reference to any notion of actually checking the "best" models to see if they make sense. Generally, a good model makes sense on some intuitive level. It seems like the claim is that CV is a silver bullet to all prediction problems. There is also no talk off setting up at the higher level of model structure - do we use SVM, Regression Trees, Boosting, Bagging, OLS, GLMS, GLMNS. Do we regularise variables? If so how? Do we group variables together? Do we want robustness to sparsity? Do we have outliers? Should we model the data as a whole or in pieces? There are too many approaches to be decided on the basis of CV.

And another important aspect is what computer systems are available? How is the data stored and processed? Is there missingness - how do we account for this?

And here is the big one: do we have sufficiently good data to make good predictions? Are there known variables that we don't have in our data set? Is our data representative of whatever it is we're trying to predict?

Cross Validation is a useful tool, but hardly revolutionary. I think the main reason people like is that it seems like a "math free" way of doing statistics. But there are many areas of CV which are not theoretically resolved - such as the size of the folds, the numbers of splits (how many times do we divide the data up into $K$ groups?), should the division be random or systematic (eg remove a state or province per fold or just some random 5%)? When does it matter? How do we measure performance? How do we account for the fact that the error rates across different folds are correlated as they are based on the same $K-2$ folds of data.

Additionally, I personally haven't seen a comparison of the trade off between computer intensive CV and less expensive methods such as REML or Variational Bayes. What do we get in exchange for spending the addiional computing time? Also seems like CV is more valuable in the "small $n$" and "big $p$" cases than the "big $n$ small $p$" one as in "big $n$ small $p$" case the out of sample error is very nearly equal to the in sample error.

Nice rant. Would've been much easier to read if you'd used the occasional caps though... — MånsT, Jul 03 '12 at 11:55

score 6 · Accepted Answer · edited Jul 03 '12 at 10:53

6

Does this make any sense? Partly.

What does he mean? Please ask him.

Do you have a clue - or perhaps even a name for the proposed method and some references?

Cross Validation. http://en.wikipedia.org/wiki/Cross-validation_(statistics)

Or did this guy find the holy grail nobody else understands? No.

He even says in this interview that his method could potentially revolutionize science... Perhaps he forgot to include the references for that statement ...

edited Jul 03 '12 at 10:53

vonjd

5,886
4
47
59

answered Jul 03 '12 at 07:55

image_doctor

750
5
9

2

Well, at least he is pointing out a true problem... – Jul 03 '12 at 20:45

score 4 · Answer 3 · answered Jul 02 '12 at 14:16

4

His explanation about a common error in data mining seems sensible. His explanation of what he does does not make any sense. What does he mean when he says "Generally speaking, you are really getting somewhere if the out-of-sample results are more than 50 percent of the in-sample."? Then bad-mouthing SAS and IBM doesn't make him look very smart either. People can have success in the market without understanding statistics and part of success is luck. It is wrong to treat successful businessmen as if they are guru's of forecasting.

answered Jul 02 '12 at 14:16

Michael R. Chernick

39,640
28
74
143

1

Is it not pretty clear what was meant by the quoted statement? Depending on how the models are to be used, what he says he does could make a lot of sense. For example, the main "takeaway" from the Netflix challenge seems to be the power of "model blending" as long as one has very little need for interpretability. In that case, some "average" out of sample performance of the models under consideration may be completely relevant. – cardinal Jul 02 '12 at 14:35
@cardinal: Could you form an answer out of these very interesting thoughts? Would be great, thank you! – vonjd Jul 02 '12 at 14:42
2

@cardinal Maybe it is clear to you but then explain the sentence " You are really getting somewhere if the out-of-sample results are more than 50 percent of the in-sample". If you are saying that emsemble averaging across models can be effective then of course I can agree with that. Boosting has been demonstrated to work well in many applications. But I do not see where that comes out of Woodriff's remarks. – Michael R. Chernick Jul 02 '12 at 14:45
2

I obviously don't know the details of what Mr. Woodriff is claiming but my interpretation of this based on the excerpt is something to the effect of: "[In my applications] if the average out-of-sample performance [using whatever metric I deem relevant] is at least half as good as the in-sample performance after fitting the model, then it's meaningful for my application." I'm a mathematician/statistician, so I need caveats. If I were a hedge-fund manager looking for some outside recognition, I might be more grandiose and absolute in my remarks. – cardinal Jul 02 '12 at 14:50
1

@cardinal So take error rate as the performance measure, then you interpret Woodriff to say that if the in sample error rate is 5% and the out of sample error rate is 10% then the method is good? Why not just look at the out of sample performance to decide? I suppose the ratio of out of sample performacne to in sample performance tells you something about how reliable/unreliable the in sample error rate estimate is but I don't see it entering into the evaluation of the classifier's performance. I still don't see where model blending enters into his remarks. – Michael R. Chernick Jul 02 '12 at 15:04

score 4 · Answer 4 · answered Jul 06 '12 at 00:56

You can look for patterns where, on average, all the models out-of-sample continue to do well.

My understanding of the word patterns here, is he means different market conditions. A naive approach will analyse all available data (we all know more data is better), to train the best curve fitting model, then run it on all data, and trade with it all the time.

The more successful hedge fund managers and algorithmic traders use their market knowledge. As a concrete example the first half hour of a trading session can be more volatile. So they'll try the models on all their data but for just that first half hour, and on all their data, but excluding that first half hour. They may discover that two of their models do well on the first half hour, but eight of them lose money. Whereas, when they exclude that first half hour, seven of their models make money, three lose money.

But, rather than taking those two winning models and use them in the first half hour of trading, they say: that is a bad time of day for algorithmic trading, and we're not going to trade at all. The rest of the day they will use their seven models. I.e. it appears that the market is easier to predict with machine learning at those times, so those models have more chance of being reliable going forward. (Time of day isn't the only pattern; others are usually related to news events, e.g. the market is more volatile just before key economic figures are announced.)

That is my interpretation of what he is saying; it may be totally wrong, but I hope it is still useful food for thought for somebody.

score 2 · Answer 5 · answered Jan 13 '13 at 01:44

2

As a finance professional I know enough context that the statement does not present any ambiguity. Financial time series are often characterized with regime changes, structural breaks, and concept drift, so cross-validation as practiced in other industries is not as successful in financial applications. In the second part he refers to a financial metric, either return on investment on Sharpe ratio (return in the numerator), not MSE or other loss function. If in-sample strategy produces 10% return, then in real trading it may quite realistically produce only 5%. The "revolutionary" part is most certainly about his proprietary analysis approach, not to the quotes.

answered Jan 13 '13 at 01:44

onlyvix.blogspot.com

121
1

A question to onlyvix: Do you know of any work using your financial metric as a tool for parameter optimization, that is, directly optimizing parameters by maximizing that metric , rather than using maximum likelihood? – kjetil b halvorsen Jan 13 '13 at 12:59
@kbh it's not my financial metric - optimizing for sharpe ratio is very common. One example just at the top of my head http://ssrn.com/abstract=962461 - no exact statistical model is developed but trading rules created to (in very general terms) maximize returns and minimize risk. – onlyvix.blogspot.com Feb 15 '13 at 01:29

New revolutionary way of data mining?

5 Answers5

Linked