R train random forest for positive or negative predicitve value, not accuracy

Question

I am working with random forests on financial data (predicting if stock rises versus falls).

I figured out that I get better performance, if I build one model for "rising" and one for "falling".

When building the models with best features sets I then just manipulate the cut-off values of my probabilities (e.g. > 0.85 for "rising").

Yet, is there a way to a priori train random forests (or other models) for positive predictive value performance (so. it should only predict "rise" with a low false positive rate - I dont care about accuracy)

Soren Havelund Welling · Accepted Answer · 2015-08-05T07:35:24.297

3

In a portfolio optimizing strategy, it is an advantage to predict both the magnitude and probability of rise and falls.

Regression solution: Convert your time-series prices into a stationary measure such as log difference or relative change. The RF predicted change of prices will both incorporate magnitude and pseudo-probability and therefore a reasonable basis for ranking stocks.

Binary solution: You could convert your time-series into rise and fall events and classify these. Train e.g. 500 trees and use the vote ratio's to rank your most and least promising stocks. In a realistic situation your best vote ratio would be something like 270-230 as it is really difficult to predict well. You do not have to retrain to manipulate cut-offs. Just extract vote-ratio and implement your own rule (cut-off, ranking, etc.)

You can plot ROC on your predictions vs. outcome to learn what a good cutoff would be. Stock price predictions is imperfect modelling. You will be happy to predict a small component of the total volatility. You will not find a subset of predictions with "low false negative". Accept that you only can know very little about many events. Active trading on predictions should be hedged over thousands of investments, to ensure a high probability of a positive net income.

You need to give feature engineering some attention. Only inputting prior days directly will give you a mediocre model. You need to build a outstanding model to consistently outperform the market. If you just could take a of-the-shelf model and set it to make safe predictions, someone would have done that before you and it would already be incorporated in the market pricing.

edited Aug 05 '15 at 07:35

answered Aug 05 '15 at 07:10

Soren Havelund Welling

6,224
26
31

Thanks for all the comments. So your answer to the just is basically that there is no way to optimize for positive or negative predictive value besides choosing the right cut-off. – user670186 Aug 05 '15 at 20:13
let me clarify a bit more what i am actually doing: I am just working on one stock at the moment to play around with classification. I indeed converted the time series in rise and fall events, then created random forests to classify different time gaps (e.g. 1 hour later, 30 minutes later). Turns out that random forest train extremely well on the data with like 2% train error with just a few predictors. Turns out however, that the performance of the model on new data has high accuracy (70-80%) but low positive or negative predictive value as I am training one model for low, another for high... – user670186 Aug 05 '15 at 20:17
Direct train error for RF class models will be close to 0%: http://stats.stackexchange.com/questions/162353/what-measure-of-training-error-to-report-for-random-forests – Soren Havelund Welling Aug 05 '15 at 21:27
70-80% accuracy, is that correctly classified test samples? What is the balance of the training set? If 70 rise 30 fall, then a "only rise" model would do equally well. AUC ROC could be a overall measure of predictive power to optimize on. – Soren Havelund Welling Aug 05 '15 at 21:46
...besides cut-off you can also play with classweight and downsampling/class stratification. It could be an advantage, but is mainly used to counter for unbalanced data. Here's a link on AUC ROC and downsampling and you also find further links for classweight: http://stats.stackexchange.com/questions/157714/r-package-for-weighted-random-forest-classwt-option/158030#158030 – Soren Havelund Welling Aug 05 '15 at 21:52
To improve model: Try lower bootstrap sample size to ~60% of N to de-correlate trees further. http://stats.stackexchange.com/questions/144397/r-randomforest-r-replace-true-pros-and-cons/147709#147709 Try to devise better features which are e.g. aggregates of distant less correlated events. – Soren Havelund Welling Aug 05 '15 at 21:58
good luck :) show some code and data maybe – Soren Havelund Welling Aug 05 '15 at 22:01
thanks for the comments. (a) regarding weights: from the readings understood, that the "classwt" is only useful if my testset balance is different from the training set balance, which however is not the case. For the "high" classifier its like 30% high, 70% low in testing and training. Or did I get this wrong and it applies in all cases where there is imbalance from 50:50? (b) article said that lowering bootstrap sample size is only usefull if OOB < 50% which is also not my case (c) features def agree – user670186 Aug 06 '15 at 13:08
btw, I found out that optimizing for AUC is actually not a good way. I did some tests and models with lower AUC got higher positive / negative predictive value resp. lower false positives. That is because I am aiming for small number of hits but with low false positives. An overfall AUC curve doesnt optimize for the lower or upper end :) – user670186 Aug 08 '15 at 13:49

R train random forest for positive or negative predicitve value, not accuracy

1 Answers1