Questions tagged [model-evaluation]

On evaluating models, either in-sample or out-of-sample.

In-sample model evaluation techniques can be based on measures of fit or , but note that in-sample fit will typically increase spuriously as the model becomes more complex, which is called . For this reason, typically in-sample fit is penalized based on model complexity, like adjusted , or . AIC and BIC are also examples of information criteria, which can also be used in-sample.

Out-of-sample model evaluation usually relies on predictive accuracy and again on . Distributional predictions can be evaluated using .

922 questions
190
votes
10 answers

Why is accuracy not the best measure for assessing classification models?

This is a general question that was asked indirectly multiple times in here, but it lacks a single authoritative answer. It would be great to have a detailed answer to this for the reference. Accuracy, the proportion of correct classifications among…
Tim
  • 108,699
  • 20
  • 212
  • 390
55
votes
7 answers

Best PCA algorithm for huge number of features (>10K)?

I previously asked this on StackOverflow, but it seems like it might be more appropriate here, given that it didn't get any answers on SO. It's kind of at the intersection between statistics and programming. I need to write some code to do PCA…
dsimcha
  • 7,375
  • 7
  • 32
  • 29
55
votes
3 answers

How to select a clustering method? How to validate a cluster solution (to warrant the method choice)?

One of the biggest issue with cluster analysis is that we may happen to have to derive different conclusion when base on different clustering methods used (including different linkage methods in hierarchical clustering). I would like to know your…
47
votes
5 answers

Optimized implementations of the Random Forest algorithm

I have noticed that there are a few implementations of random forest such as ALGLIB, Waffles and some R packages like randomForest. Can anybody tell me whether these libraries are highly optimized? Are they basically equivalent to the random…
Henry B.
  • 1,479
  • 1
  • 14
  • 19
36
votes
1 answer

Cross-validation misuse (reporting performance for the best hyperparameter value)

Recently I have come across a paper that proposes using a k-NN classifier on an specific dataset. The authors used all the data samples available to perform k-fold cross validation for different k values and report cross validation results of the…
35
votes
3 answers

Classification/evaluation metrics for highly imbalanced data

I deal with a fraud detection (credit-scoring-like) problem. As such there is a highly imbalanced relation between fraudulent and non-fraudulent observations. http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html provides a great…
30
votes
3 answers

Can AUC-ROC be between 0-0.5?

Can AUC-ROC values be between 0-0.5? Does the model ever output values between 0 and 0.5?
Aman
  • 533
  • 1
  • 6
  • 10
27
votes
3 answers

Evaluating logistic regression and interpretation of Hosmer-Lemeshow Goodness of Fit

As we all know, there are 2 methods to evaluate the logistic regression model and they are testing very different things Predictive power: Get a statistic that measures how well you can predict the dependent variable based on the independent…
22
votes
2 answers

Proper scoring rule when there is a decision to make (e.g. spam vs ham email)

Among others on here, Frank Harrell is adamant about using proper scoring rules to assess classifiers. This makes sense. If we have 500 $0$s with $P(1)\in[0.45, 0.49]$ and 500 $1$s with $P(1)\in[0.51, 0.55]$, we can get a perfect classifier by…
Dave
  • 28,473
  • 4
  • 52
  • 104
20
votes
3 answers

AUC and class imbalance in training/test dataset

I just start to learn the Area under the ROC curve (AUC). I am told that AUC is not reflected by data imbalance. I think it means that AUC is insensitive to imbalance in test data, rather than imbalance in training data. In other words, only…
Munichong
  • 1,645
  • 3
  • 15
  • 26
17
votes
2 answers

Why use Normalized Gini Score instead of AUC as evaluation?

Kaggle's competition Porto Seguro's Safe Driver Prediction uses Normalized Gini Score as evaluation metric and this got me curious about the reasons for this choice. What are the advantages of using normalized gini score instead of the most usual…
xboard
  • 1,008
  • 11
  • 17
16
votes
2 answers

What is the difference between $R^2$ and variance score in Scikit-learn?

I was reading about regression metrics in the python scikit-learn manual and even though each one of them has its own formula, I cannot tell intuitively what is the difference between $R^2$ and variance score and therefore when to use one or another…
16
votes
4 answers

Why isn't the holdout method (splitting data into training and testing) used in classical statistics?

In my classroom exposure to data mining, the holdout method was introduced as a way of assessing model performance. However, when I took my first class on linear models, this was not introduced as a means of model validation or assessment. My online…
15
votes
3 answers

Relationship between the phi, Matthews and Pearson correlation coefficients

Are the phi and Matthews correlation coefficients the same concept? How are they related or equivalent to Pearson correlation coefficient for two binary variables? I assume the binary values are 0 and 1. The Pearson's correlation between two…
15
votes
1 answer

How to Compute the Brier Score for more than Two Classes

tl;dr How do I correctly compute the Brier score for more than two classes? I got confusing results with different approaches. Details below. As suggested to me in a comment to this question, I would like to evaluate the quality of a set of…
1
2 3
61 62