Most Popular
1500 questions
53
votes
5 answers
Find expected value using CDF
I'm going to start out by saying this is a homework problem straight out of the book. I have spent a couple hours looking up how to find expected values, and have determined I understand nothing.
Let $X$ have the CDF $F(x) = 1 - x^{-\alpha},…

styfle
- 737
- 1
- 8
- 12
52
votes
6 answers
Rule of thumb for number of bootstrap samples
I wonder if someone knows any general rules of thumb regarding the number of bootstrap samples one should use, based on characteristics of the data (number of observations, etc.) and/or the variables included?

hoyem
- 871
- 1
- 7
- 10
52
votes
5 answers
How are propensity scores different from adding covariates in a regression, and when are they preferred to the latter?
I admit I'm relatively new to propensity scores and causal analysis.
One thing that's not obvious to me as a newcomer is how the "balancing" using propensity scores is mathematically different from what happens when we add covariates in a…

Frank Barry
- 671
- 1
- 7
- 5
52
votes
4 answers
How to do logistic regression subset selection?
I am fitting a binomial family glm in R, and I have a whole troupe of explanatory variables, and I need to find the best (R-squared as a measure is fine). Short of writing a script to loop through random different combinations of the explanatory…

Leendert
- 621
- 1
- 6
- 4
52
votes
4 answers
Is there a test to determine whether GLM overdispersion is significant?
I'm creating Poisson GLMs in R. To check for overdispersion I'm looking at the ratio of residual deviance to degrees of freedom provided by summary(model.name).
Is there a cutoff value or test for this ratio to be considered "significant?" I know…

kto
- 645
- 1
- 7
- 8
52
votes
3 answers
What is Deviance? (specifically in CART/rpart)
What is "Deviance," how is it calculated, and what are its uses in different fields in statistics?
In particular, I'm personally interested in its uses in CART (and its implementation in rpart in R).
I'm asking this since the wiki-article seems…

Tal Galili
- 19,935
- 32
- 133
- 195
52
votes
2 answers
Using lmer for repeated-measures linear mixed-effect model
EDIT 2: I originally thought I needed to run a two-factor ANOVA with repeated measures on one factor, but I now think a linear mixed-effect model will work better for my data. I think I nearly know what needs to happen, but am still confused by few…

phosphorelated
- 743
- 2
- 7
- 9
52
votes
5 answers
R - QQPlot: how to see whether data are normally distributed
I have plotted this after I did a Shapiro-Wilk normality test. The test showed that it is likely that the population is normally distributed. However, how to see this "behaviour" on this plot?
UPDATE
A simple histogram of the data:
UPDATE
The…

Le Max
- 3,559
- 9
- 26
- 26
52
votes
4 answers
Why does the correlation coefficient between X and X-Y random variables tend to be 0.7
Taken from Practical Statistics for Medical Research where Douglas Altman writes in page 285:
...for any two quantities X and Y, X will be correlated with X-Y.
Indeed, even if X and Y are samples of random numbers we would expect
the…

nostock
- 1,337
- 4
- 15
- 22
52
votes
1 answer
Do we have to tune the number of trees in a random forest?
Software implementations of random forest classifiers have a number of parameters to allow users to fine-tune the algorithm's behavior, including the number of trees $T$ in the forest. Is this a parameter that needs to be tuned, in the same way as…

Sycorax
- 76,417
- 20
- 189
- 313
52
votes
8 answers
Excel as a statistics workbench
It seems that lots of people (including me) like to do exploratory data analysis in Excel. Some limitations, such as the number of rows allowed in a spreadsheet, are a pain but in most cases don't make it impossible to use Excel to play around with…

Carlos Accioly
- 4,715
- 4
- 25
- 28
52
votes
1 answer
CNN architectures for regression?
I've been working on a regression problem where the input is an image, and the label is a continuous value between 80 and 350. The images are of some chemicals after a reaction takes place. The color that turns out indicates the concentration of…

rodrigo-silveira
- 1,138
- 3
- 12
- 16
52
votes
6 answers
How to perform a test using R to see if data follows normal distribution
I have a data set with following structure:
a word | number of occurrence of a word in a document | a document id
How can I perform a test for normal distribution in R? Probably it is an easy question but I am a R newbie.

Skarab
- 987
- 4
- 11
- 14
52
votes
2 answers
Why would R return NA as a lm() coefficient?
I am fitting an lm() model to a data set that includes indicators for the financial quarter (Q1, Q2, Q3, making Q4 a default). Using lm(Y~., data = data) I get a NA as the coefficient for Q3, and a warning that one variable was exclude because of…

Fraijo
- 1,018
- 1
- 7
- 10
52
votes
1 answer
Obtaining predicted values (Y=1 or 0) from a logistic regression model fit
Let's say that I have an object of class glm (corresponding to a logistic regression model) and I'd like to turn the predicted probabilities given by predict.glm using the argument type="response" into binary responses, i.e. $Y=1$ or $Y=0$. What's…

tetragrammaton
- 1,336
- 2
- 12
- 13