Most Popular

1500 questions
33
votes
4 answers

How to measure smoothness of a time series in R?

Is there a good way to measure smoothness of a time series in R? For example, -1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1.0 is much smoother than -1, 0.8, -0.6, 0.4, -0.2, 0, 0.2, -0.4, 0.6, -0.8, 1.0 although they have same mean and…
agmao
  • 431
  • 1
  • 4
  • 3
33
votes
4 answers

Independent variable = Random variable?

I'm slightly confused if an independent variable (also called predictor or feature) in a statistical model, for example the $X$ in linear regression $Y=\beta_0+\beta_1 X$, is a random variable ?
l7ll7
  • 1,075
  • 2
  • 9
  • 15
33
votes
1 answer

What are the properties of a half Cauchy distribution?

I am currently working on a problem, where I need to develop a Markov chain Monte Carlo (MCMC) algorithm for a state space model. To be able to solve the problem, I have been given the following probability of $\tau$: p($\tau$) =…
33
votes
3 answers

Is there a Project Euler-alike for machine learning?

I found Project Euler http://projecteuler.net/ to be incredibly useful in learning programming languages. Is there a similar site for Machine Learning? I did see http://www.kaggle.com/, but it is not nearly as accessible to beginners as Project…
B Seven
  • 2,873
  • 4
  • 24
  • 29
33
votes
3 answers

Is whitening always good?

A common pre-processing step for machine learning algorithms is whitening of data. It seems like it is always good to do whitening since it de-correlates the data, making it simpler to model. When is whitening not recommended? Note: I'm referring to…
Ran
  • 1,476
  • 3
  • 16
  • 25
33
votes
3 answers

In boosting, why are the learners "weak"?

See also a similar question on stats.SE. In boosting algorithms such as AdaBoost and LPBoost it is known that the "weak" learners to be combined only have to perform better than chance to be useful, from Wikipedia: The classifiers it uses can be…
tdc
  • 7,289
  • 5
  • 32
  • 62
33
votes
3 answers

Do we need a test set when using k-fold cross-validation?

I've been reading about k-fold validation, and I want to make sure I understand how it works. I know that for the holdout method, the data is split into three sets, and the test set is only used at the very end to assess the performance of the…
b_pcakes
  • 435
  • 1
  • 4
  • 5
33
votes
1 answer

How to train and validate a neural network model in R?

I am new to modeling with neural networks, but I managed to establish a neural network with all available data points that fits the observed data well. The neural network was done in R with the nnet package: require(nnet) ##33.8 is the highest…
Strohmi
  • 815
  • 1
  • 10
  • 13
33
votes
4 answers

Optimising for Precision-Recall curves under class imbalance

I have a classification task where I have a number of predictors (one of which is the most informative), and I am using the MARS model to construct my classifier (I am interested in any simple model, and using glms for illustrative purposes would be…
33
votes
4 answers

How to create an arbitrary covariance matrix

For example, in R, the MASS::mvrnorm() function is useful for generating data to demonstrate various things in statistics. It takes a mandatory Sigma argument which is a symmetric matrix specifying the covariance matrix of the variables. How would…
rsl
  • 845
  • 2
  • 9
  • 15
33
votes
2 answers

Understanding bias-variance tradeoff derivation

I am reading the chapter on the bias-variance tradeoff in The elements of statistical learning and I don't understand the formula on page 29. Let the data arise from a model such that $$ Y = f(x)+\varepsilon$$ where $\varepsilon$ is random number…
33
votes
8 answers

Is there a plateau-shaped distribution?

I am looking for a distribution where the probability density decreases quickly after some point away from the mean, or in my own words a "plateau-shaped distribution". Something in between the Gaussian and the uniform.
dontloo
  • 13,692
  • 7
  • 51
  • 80
33
votes
2 answers

How to model non-negative zero-inflated continuous data?

I'm currently trying to apply a linear model (family = gaussian) to an indicator of biodiversity that cannot take values lower than zero, is zero-inflated and is continuous. Values range from 0 to a little over 0.25. As a consequence, there is quite…
33
votes
3 answers

When to use fixed effects vs using cluster SEs?

Suppose you have a single cross-section of data where individuals are located within groups (e.g. students within schools) and you wish to estimate a model of the form Y_i = a + B*X_i where X is a vector of individual level characteristics and a a…
33
votes
3 answers

Why is variable selection necessary?

Common data-based variable selection procedures (for example, forward, backward, stepwise, all subsets) tend to yield models with undesirable properties, including: Coefficients biased away from zero. Standard errors that are too small and…
user7322