Most Popular

1500 questions
55
votes
6 answers

Why on average does each bootstrap sample contain roughly two thirds of observations?

I have run across the assertion that each bootstrap sample (or bagged tree) will contain on average approximately $2/3$ of the observations. I understand that the chance of not being selected in any of $n$ draws from $n$ samples with replacement is…
xyzzy
  • 823
  • 2
  • 8
  • 7
55
votes
4 answers

How to visualize a fitted multiple regression model?

I am currently writing a paper with several multiple regression analyses. While visualizing univariate linear regression is easy via scatter plots, I was wondering whether there is any good way to visualize multiple linear regressions? I am…
Shawn Wang
  • 1,245
  • 3
  • 12
  • 12
55
votes
5 answers

Using deep learning for time series prediction

I'm new in area of deep learning and for me first step was to read interesting articles from deeplearning.net site. In papers about deep learning, Hinton and others mostly talk about applying it to image problems. Can someone try to answer me can it…
55
votes
4 answers

How to identify a bimodal distribution?

I understand that once we plot the values as a chart, we can identify a bimodal distribution by observing the twin-peaks, but how does one find it programmatically? (I am looking for an algorithm.)
venkasub
  • 683
  • 1
  • 6
  • 7
55
votes
11 answers

Is there a 1 in 20 or 1 in 400 chance of guessing the outcome of a d20 roll before it happens?

My friends are in a bit of an argument over Dungeons & Dragons. My player managed to guess the outcome of a D20 roll before it happened, and my friend said that his chance of guessing the number was 1 in 20. Another friend argues that his chance of…
Theguy Whatguys
  • 633
  • 2
  • 7
55
votes
4 answers

How do we decide when a small sample is statistically significant or not?

Sorry if the title isn't clear, I'm not a statistician, and am not sure how to phrase this. I was looking at the global coronavirus statistics on worldometers, and sorted the table by cases per million population to get an idea of how different…
Avrohom Yisroel
  • 673
  • 5
  • 7
55
votes
19 answers

Mathematical Statistics Videos

A question previously sought recommendations for textbooks on mathematical statistics Does anyone know of any good online video lectures on mathematical statistics? The closest that I've found are: Machine Learning Econometrics UPDATE: A number…
Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
55
votes
12 answers

Is the COVID-19 pandemic curve a Gaussian curve?

We've all heard a lot about "flattening the curve". I was wondering if these curve – that look like bells – can be qualified as Gaussian despite the fact that there is a temporal dimension.
Samos
  • 804
  • 1
  • 8
  • 17
55
votes
8 answers

Is sampling relevant in the time of 'big data'?

Or more so "will it be"? Big Data makes statistics and relevant knowledge all the more important but seems to underplay Sampling Theory. I've seen this hype around 'Big Data' and can't help wonder that "why" would I want to analyze everything?…
PhD
  • 13,429
  • 19
  • 45
  • 47
55
votes
4 answers

Normalization vs. scaling

What is the difference between data 'Normalization' and data 'Scaling'? Till now I thought both terms refers to same process but now I realize there is something more that I don't know/understand. Also if there is a difference between Normalization…
55
votes
3 answers

Where does the misconception that Y must be normally distributed come from?

Seemingly reputable sources claim that the dependent variable must be normally distributed: Model assumptions: $Y$ is normally distributed, errors are normally distributed, $e_i \sim N(0,\sigma^2)$, and independent, and $X$ is fixed, and …
colorlace
  • 1,010
  • 11
  • 25
55
votes
6 answers

How to determine best cutoff point and its confidence interval using ROC curve in R?

I have the data of a test that could be used to distinguish normal and tumor cells. According to ROC curve it looks good for this purpose (area under curve is 0.9): My questions are: How to determine cutoff point for this test and its confidence…
Yuriy Petrovskiy
  • 4,081
  • 7
  • 25
  • 30
55
votes
7 answers

Best PCA algorithm for huge number of features (>10K)?

I previously asked this on StackOverflow, but it seems like it might be more appropriate here, given that it didn't get any answers on SO. It's kind of at the intersection between statistics and programming. I need to write some code to do PCA…
dsimcha
  • 7,375
  • 7
  • 32
  • 29
55
votes
5 answers

Statistical inference when the sample "is" the population

Imagine you have to do reporting on the numbers of candidates who yearly take a given test. It seems rather difficult to infer the observed % of success, for instance, on a wider population due to the specifity of the target population. So you may…
pbneau
  • 1,161
  • 4
  • 13
  • 17
55
votes
3 answers

How to select a clustering method? How to validate a cluster solution (to warrant the method choice)?

One of the biggest issue with cluster analysis is that we may happen to have to derive different conclusion when base on different clustering methods used (including different linkage methods in hierarchical clustering). I would like to know your…