7

I'm currently trying to predict the probability for low probability events (~1%). I have large DB with ~200,000 vectors (~2000 plus examples) with ~200 features. I'm trying to find the the best features for my problem. What are the recommended method? (preferred in Python or R, but not necessarily)

Ferdi
  • 4,882
  • 7
  • 42
  • 62
user5497
  • 237
  • 3
  • 7
  • possible duplicate: http://stats.stackexchange.com/questions/10608/questions-about-variable-selection-for-classification-and-different-classificati. The OP of that question has 150k examples, 100 features and class ratio of 95%/5% – mlwida Jul 21 '11 at 12:47
  • Thanks, This is a very similar case, however our problem is finding the exact probability to each vector, and not classify it – user5497 Jul 21 '11 at 16:05

2 Answers2

7

My first advice would be that unless identifying the informative features is a goal of the analysis, don't bother with feature selection and just use a regularised model, such a penalised logistic regression, ridge regression or SVM, and let the regularisation handle the over-fitting. It is often said that feature selection improves classifier performance, but it isn't always true.

To deal with the class imbalance problem, give different weights to the patterns from each class in calculating the loss function used to fit the model. Choose the ratio of weights by cross-validation (for a probabilistic classifier you can work out the asymptically optimal weights, but it generally won't give optimal results on a finite sample). If you are using a classifier that can't give different weights to each class, then sub-sample the majority class instead, where again the ratio of positive and negative patterns is determined by cross-validation (make sure the test partition in each fold of the cross-validation procedure has the same relative class frequencies you expect to see in operation).

Lastly, it is often the case in practical application with a class imbalance that false-positives and false-negatives are not of equal seriousness, so incorporate this into the construction of the classifier.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • Hi, Thanks for your fast and interesting answer. We're familiar with those classifiers, however our problem is a bit different. We try to estimate the exact probability for each vector (or sample). This can be done (as far as we know by using regular logistic regressions or naive bayes. Both of those classifier yields a probability for each vector (and not some score). In addition they both suffer from over-fitting when using too many variables. So we fill that we need a good feature selection scheme to find the best features for the classifier. – user5497 Jul 21 '11 at 16:06
  • In addition, we can't weight our samples and still get the right probability. We haven't tried PCA yet, but its on our TODO :) Do you have any suggestions? Thanks a lot! – user5497 Jul 21 '11 at 16:12
  • 3
    @user5497 1. Naive Bayes does not deliver the exact probability, but a heavily skewed score due to the independence-asssumption, LogRes however, does. 2. The output of svm can be calibrated to approximate a probability (e.g. see http://stats.stackexchange.com/questions/10387/what-do-real-values-refer-to-in-supervised-classification ) – mlwida Jul 21 '11 at 16:18
  • mmm... very interesting. In this case, do you think we should apply weighting (or "zero" reduction) before the classification? I will also check this. Thanks! – user5497 Jul 21 '11 at 16:55
  • You mentioned several times that feature selection is not useful for modern classifiers. I'm wondering what is this based on. Is there something like a survey paper that tested K classifiers and N feature selection methods and concluded that they are useless? In general, I can easily believe that feature selection doesn't help SVM, but there are many other classifiers which are widely used and for which feature selection seems useful. – SheldonCooper Jul 21 '11 at 19:08
  • @user5497 I use regularised logistic regression regularly (which also gives a probability vector), I expect there is an R implementation (I use my own MATLAB code), so it should be fine for your application. It is also used for tasks like microarray classification (there are several papers on this via Google scholar - also search for "penalised logistic regression"). It is the regularisation that gives the robistness to over-fitting and it can be used with a wide variety of classifiers. It is easy to add pattern weights for logistic regression, but you may need to write your own code. – Dikran Marsupial Jul 22 '11 at 10:37
  • 1
    @Sheldon I don't know of a survey paper on this, possibly as journals are not very receptive to papers with negative findings. It is fairly rare to find papers presenting feature selection methods that use a regularised classifier as a baseline, perhaps it ought to be a requirement! IIRC the NIPS feature selection challenge was won by a team that essentially did no feature selection. Regularisation is the key (there is nothing really special about the SVM in that respect), any classifier with good control of model complexity ought to be O.K. with large numbers of features. – Dikran Marsupial Jul 22 '11 at 10:41
  • 1
    @sheldon As I mentioned in my answer to the related question, I think the idea that feature selection is necessary comes from traditional regression where model complexity is based pretty much solely on the number of parameters. Regularisation gives a different view of model complexity as the value of the regularisation parameter imposes a nested structure of models of increasing complexity, where the number of parameters is much less relevant. This means the old wisdom about regression doesn't apply to e.g. ridge regression; but hey, RR has only been around since 1970 ;o) – Dikran Marsupial Jul 22 '11 at 10:46
2

The problem of estimating probabilities falls under the category of "regression," since the probability is a conditional mean. Classical methods for feature selection (AKA "subset selection" or "model selection") methods for regression include best-k, forward- and backward- stepwise, and forward stagewise, all described in Chapter 3 of Elements of Statistical Learning. However, such methods are generally costly, and given the number of features in your dataset, my choice would be to use glmpath, which implements L1-regularized regression using a modification of the fantastically efficient LARS algorithm.

EDIT: More details on L1 regularization. The LARS algorithm produces the entire "Lasso" path for $\lambda$ (the regularization constant) ranging from 0 to $\infty$. At $\lambda=0$, all features are used; at $\lambda=\infty$, none of the features have nonzero coefficients. In between there are values of $\lambda$ for which anywhere from 1 to 199 features are used.

Using the results from LARS one can select the values of $\lambda$ with the best performance (according to whatever criteria). Then, using only the features with nonzero coefficients for a particular $\lambda$, one can then fit an unregularized logistic regression model for the final prediction.

charles.y.zheng
  • 7,346
  • 2
  • 28
  • 32
  • I agree, however the problem with using l1/l2-regularized regression (LARS or logistic regressions), is that as far as I understand (and maybe I'm wrong), you don't get the right probability because of the regularization (is that correct?). In addition we found out that the logistic regression gives us better results in probability estimation than linear regression, when running on the entire dataset. – user5497 Jul 22 '11 at 09:13
  • One more thing - I had some problems running "advanced" (RF, l1,...) classifiers on all my dataset (because of its size...), So I'm thinking of using weighting / data reduction, but again, I have a problem with finding the right probability (I found that finding the right probability after weighting is not straight forward..., though I haven't yet tried the methods that were mentioned above) Thanks! – user5497 Jul 22 '11 at 09:14
  • Charles, any comments on Donoho and Jin, [Higher criticism thresholding: Optimal feature selection when useful features are rare and weak](http://www.pnas.org/content/105/39/14790.full) (2008, 6p), or should I ask a separate question ? – denis Jul 22 '11 at 12:59
  • Unless there is a direct connection to the OP's problem, it would be best to ask a separate question. Feature selection is a big topic. – charles.y.zheng Jul 22 '11 at 13:03
  • @user5497 The error of an estimate can be broken down into bias and variance components. Ordinary logistic regression gives asymptotically unbiased estimates of probability, i.e. in the limit of an infinite dataset, the bias component is zero (assuming the model is "correct"). However the variability of the model due to the sampling of the training data can be high for small datasets. Regularised models trade-off bias for reduced variability, and hence can achieve better estimates of probability for small datasets (or equivalently datasets with many features) than the standard model. – Dikran Marsupial Jul 22 '11 at 13:42