Most Popular

1500 questions
34
votes
5 answers

How to split dataset for time-series prediction?

I have historic sales data from a bakery (daily, over 3 years). Now I want to build a model to predict future sales (using features like weekday, weather variables, etc.). How should I split the dataset for fitting and evaluating the models? Does…
tobip
  • 1,450
  • 4
  • 14
  • 11
34
votes
10 answers

How to represent an unbounded variable as number between 0 and 1

I want to represent a variable as a number between 0 and 1. The variable is a non-negative integer with no inherent bound. I map 0 to 0 but what can I map to 1 or numbers between 0 and 1? I could use the history of that variable to provide the…
Russell Gallop
  • 443
  • 1
  • 4
  • 5
34
votes
2 answers

Satterthwaite vs. Kenward-Roger approximations for the degrees of freedom in mixed models

The lmerTest package provides an anova() function for linear mixed models with optionally Satterthwaite's (default) or Kenward-Roger's approximation of the degrees of freedom (df). What is the difference between these two approaches? When to choose…
doko
  • 441
  • 1
  • 4
  • 4
34
votes
1 answer

Relation between variational Bayes and EM

I read somewhere that Variational Bayes method is a generalization of the EM algorithm. Indeed, the iterative parts of the algorithms are very similar. In order to test whether the EM algorithm is a special version of the Variational Bayes, I tried…
Ufuk Can Bicici
  • 2,028
  • 1
  • 17
  • 26
34
votes
6 answers

Data mining: How should I go about finding the functional form?

I'm curious about repeatable procedures that can be used to discover the functional form of the function y = f(A, B, C) + error_term where my only input is a set of observations (y, A, B and C). Please note that the functional form of fis…
33
votes
5 answers

What are the relative merits of Winsorizing vs. Trimming data?

Winsorizing data means to replace the extreme values of a data set with a certain percentile value from each end, while Trimming or Truncating involves removing those extreme values. I always see both methods discussed as a viable option to lessen…
Brian
  • 551
  • 1
  • 5
  • 8
33
votes
5 answers

Why do political polls have such large sample sizes?

When I watch the news I've noticed that the Gallup polls for things like presidential elections have [I assume random] sample sizes of well over 1,000. From what I remember from college statistics was that a sample size of 30 was a "significantly…
samplesize999
  • 331
  • 3
  • 3
33
votes
3 answers

How to interpret the dendrogram of a hierarchical cluster analysis

Consider the R example below: plot( hclust(dist(USArrests), "ave") ) What exactly does the y-axis "Height" mean? Looking at North Carolina and California (rather on the left). Is California "closer" to North Carolina than Arizona? Can I make this…
Richi W
  • 3,216
  • 3
  • 30
  • 53
33
votes
6 answers

What would a robust Bayesian model for estimating the scale of a roughly normal distribution be?

There exists a number of robust estimators of scale. A notable example is the median absolute deviation which relates to the standard deviation as $\sigma = \mathrm{MAD}\cdot1.4826$. In a Bayesian framework there exist a number of ways to robustly…
Rasmus Bååth
  • 6,422
  • 34
  • 57
33
votes
8 answers

Replacing outliers with mean

This question was asked by my friend who is not internet savvy. I've no statistics background and I've been searching around internet for this question. The question is : is it possible to replace outliers with mean value? if it's possible, is…
Alun
  • 433
  • 1
  • 4
  • 5
33
votes
5 answers

How to change data between wide and long formats in R?

You can have data in wide format or in long format. This is quite an important thing, as the useable methods are different, depending on the format. I know you have to work with melt() and cast() from the reshape package, but there seems some things…
Mien
  • 719
  • 3
  • 9
  • 18
33
votes
3 answers

Why not report the mean of a bootstrap distribution?

When one bootstraps a parameter to get the standard error we get a distribution of the parameter. Why don't we use the mean of that distribution as a result or estimate for the parameter we are trying to get? Shouldn't the distribution approximate…
33
votes
4 answers

How do I fit a multilevel model for over-dispersed poisson outcomes?

I want to fit a multilevel GLMM with a Poisson distribution (with over-dispersion) using R. At the moment I am using lme4 but I noticed that recently the quasipoisson family was removed. I've seen elsewhere that you can model additive…
33
votes
2 answers

Drawing from Dirichlet distribution

Let's say we have a Dirichlet distribution with $K$-dimensional vector parameter $\vec\alpha = [\alpha_1, \alpha_2,...,\alpha_K]$. How can I draw a sample (a $K$-dimensional vector) from this distribution? I need a (possibly) simple explanation.
user1315305
  • 1,199
  • 4
  • 14
  • 15
33
votes
8 answers

What math subjects would you suggest to prepare for data mining and machine learning?

I'm trying to put together a self-directed math curriculum to prepare for learning data mining and machine learning. This is motivated by starting Andrew Ng's machine learning class on Coursera and feeling that before proceeding I needed to improve…