Most Popular

1500 questions
90
votes
1 answer

When to use an offset in a Poisson regression?

Does anybody know why offset in a Poisson regression is used? What do you achieve by this?
MarkDollar
  • 5,575
  • 14
  • 44
  • 60
89
votes
1 answer

What correlation makes a matrix singular and what are implications of singularity or near-singularity?

I am doing some calculations on different matrices (mainly in logistic regression) and I commonly get the error "Matrix is singular", where I have to go back and remove the correlated variables. My question here is what would you consider a "highly"…
Error404
  • 1,261
  • 2
  • 13
  • 18
89
votes
4 answers

How to produce a pretty plot of the results of k-means cluster analysis?

I'm using R to do K-means clustering. I'm using 14 variables to run K-means What is a pretty way to plot the results of K-means? Are there any existing implementations? Does having 14 variables complicate plotting the results? I found something…
89
votes
5 answers

How to plot ROC curves in multiclass classification?

In other words, instead of having a two class problem I am dealing with 4 classes and still would like to assess performance using AUC.
CLOCK
89
votes
5 answers

On the importance of the i.i.d. assumption in statistical learning

In statistical learning, implicitly or explicitly, one always assumes that the training set $\mathcal{D} = \{ \bf {X}, \bf{y} \}$ is composed of $N$ input/response tuples $({\bf{X}}_i,y_i)$ that are independently drawn from the same joint…
Quantuple
  • 1,296
  • 1
  • 8
  • 20
89
votes
5 answers

Relationship between poisson and exponential distribution

The waiting times for poisson distribution is an exponential distribution with parameter lambda. But I don't understand it. Poisson models the number of arrivals per unit of time for example. How is this related to exponential distribution? Lets say…
user862
  • 2,339
  • 4
  • 27
  • 24
89
votes
10 answers

How should outliers be dealt with in linear regression analysis?

Often times a statistical analyst is handed a set dataset and asked to fit a model using a technique such as linear regression. Very frequently the dataset is accompanied with a disclaimer similar to "Oh yeah, we messed up collecting some of these…
Sharpie
  • 4,126
  • 5
  • 21
  • 18
89
votes
10 answers

What is a complete list of the usual assumptions for linear regression?

What are the usual assumptions for linear regression? Do they include: a linear relationship between the independent and dependent variable independent errors normal distribution of errors homoscedasticity Are there any others?
tony
  • 899
  • 2
  • 7
  • 3
89
votes
2 answers

Resampling / simulation methods: monte carlo, bootstrapping, jackknifing, cross-validation, randomization tests, and permutation tests

I am trying to understand difference between different resampling methods (Monte Carlo simulation, parametric bootstrapping, non-parametric bootstrapping, jackknifing, cross-validation, randomization tests, and permutation tests) and their…
Ram Sharma
  • 2,226
  • 3
  • 20
  • 24
88
votes
8 answers

When is unbalanced data really a problem in Machine Learning?

We already had multiple questions about unbalanced data when using logistic regression, SVM, decision trees, bagging and a number of other similar questions, what makes it a very popular topic! Unfortunately, each of the questions seems to be…
Tim
  • 108,699
  • 20
  • 212
  • 390
88
votes
24 answers

Rules of thumb for "modern" statistics

I like G van Belle's book on Statistical Rules of Thumb, and to a lesser extent Common Errors in Statistics (and How to Avoid Them) from Phillip I Good and James W. Hardin. They address common pitfalls when interpreting results from experimental and…
chl
  • 50,972
  • 18
  • 205
  • 364
88
votes
3 answers

What is the lasso in regression analysis?

I'm looking for a non-technical definition of the lasso and what it is used for.
Paul Vogt
  • 881
  • 1
  • 7
  • 3
88
votes
7 answers

Calculating the parameters of a Beta distribution using the mean and variance

How can I calculate the $\alpha$ and $\beta$ parameters for a Beta distribution if I know the mean and variance that I want the distribution to have? Examples of an R command to do this would be most helpful.
Dave Kincaid
  • 1,458
  • 1
  • 12
  • 18
88
votes
6 answers

How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results?

How would you know if your (high dimensional) data exhibits enough clustering so that results from kmeans or other clustering algorithm is actually meaningful? For k-means algorithm in particular, how much of a reduction in within-cluster variance…
xuexue
  • 2,098
  • 2
  • 16
  • 11
87
votes
3 answers

Shape of confidence interval for predicted values in linear regression

I have noticed that the confidence interval for predicted values in an linear regression tends to be narrow around the mean of the predictor and fat around the minimum and maximum values of the predictor. This can be seen in plots of these 4 linear…
luciano
  • 12,197
  • 30
  • 87
  • 119