Most Popular
1500 questions
43
votes
5 answers
Fake uniform random numbers: More evenly distributed than true uniform data
I'm looking for a way to generate random numbers that appear to be uniform distributed -- and every test will show them to be uniform -- except that they are more evenly distributed than true uniform data.
The problem I have with the "true" uniform…

Has QUIT--Anony-Mousse
- 39,639
- 7
- 61
- 96
43
votes
4 answers
Is it possible to give variable sized images as input to a convolutional neural network?
Can we give images with variable size as input to a convolutional neural network for object detection? If possible, how can we do that?
But if we try to crop the image, we will be loosing some portion of the image and if we try to resize, then, the…

Ashna Eldho
- 531
- 1
- 4
- 4
43
votes
8 answers
How do I get people to take better care of data?
My workplace has employees from a very wide range of disciplines, so we generate data in lots of different forms. Consequently, each team has developed its own system for storing data. Some use Access or SQL databases; some teams (to my horror)…

Richie Cotton
- 644
- 9
- 15
43
votes
6 answers
How to quasi match two vectors of strings (in R)?
I am not sure how this should be termed, so please correct me if you know a better term.
I've got two lists. One of 55 items (e.g: a vector of strings), the other of 92. The item names are similar but not identical.
I wish to find the best…

Tal Galili
- 19,935
- 32
- 133
- 195
43
votes
6 answers
Why do I get a 100% accuracy decision tree?
I'm getting a 100% accuracy for my decision tree. What am I doing wrong?
This is my code:
import pandas as pd
import json
import numpy as np
import sklearn
import matplotlib.pyplot as plt
data =…

Nadjla
- 441
- 1
- 4
- 4
43
votes
2 answers
If only prediction is of interest, why use lasso over ridge?
On page 223 in An Introduction to Statistical Learning, the authors summarise the differences between ridge regression and lasso. They provide an example (Figure 6.9) of when "lasso tends to outperform ridge regression in terms of bias, variance,…

Oliver Angelil
- 1,129
- 1
- 11
- 24
43
votes
2 answers
Who invented stochastic gradient descent?
I'm trying to understand the history of Gradient descent and Stochastic gradient descent. Gradient descent was invented in Cauchy in 1847.Méthode générale pour la résolution des systèmes d'équations simultanées. pp. 536–538 For more information…

DaL
- 4,462
- 3
- 16
- 27
43
votes
5 answers
How to perform two-sample t-tests in R by inputting sample statistics rather than the raw data?
Let's say we have the statistics given below
gender mean sd n
f 1.666667 0.5773503 3
m 4.500000 0.5773503 4
How do you perform a two-sample t-test (to see if there is a significant difference between the means of men and women in some variable)…

Alby
- 2,103
- 3
- 19
- 22
43
votes
9 answers
Correlation does not imply causation; but what about when one of the variables is time?
I know this question has been asked a billion times, so, after looking online, I am fully convinced that correlation between 2 variables does not imply causation. In one of my stats lectures today, we had a guest lecture from a physicist, on the…

Thomas Moore
- 1,375
- 10
- 17
43
votes
3 answers
Variance of $K$-fold cross-validation estimates as $f(K)$: what is the role of "stability"?
TL,DR: It appears that, contrary to oft-repeated advice, leave-one-out cross validation (LOO-CV) -- that is, $K$-fold CV with $K$ (the number of folds) equal to $N$ (the number of training observations) -- yields estimates of the generalization…

Jake Westfall
- 11,539
- 2
- 48
- 96
43
votes
10 answers
How to efficiently generate random positive-semidefinite correlation matrices?
I would like to be able to efficiently generate positive-semidefinite (PSD) correlation matrices. My method slows down dramatically as I increase the size of matrices to be generated.
Could you suggest any efficient solutions? If you are aware of…

Eduardas
- 2,239
- 4
- 23
- 22
43
votes
4 answers
When should I balance classes in a training data set?
I had an online course, where I learned, that unbalanced classes in the training data might lead to problems, because classification algorithms go for the majority rule, as it gives good results if the unbalance is too much. In an assignment one had…

Zelphir Kaltstahl
- 613
- 1
- 7
- 10
43
votes
6 answers
Neural network references (textbooks, online courses) for beginners
I want to learn Neural Networks. I am a Computational Linguist. I know statistical machine learning approaches and can code in Python.
I am looking to start with its concepts, and know one or two popular models which may be useful from a…

HIGGINS
- 479
- 8
- 12
43
votes
2 answers
Area under Precision-Recall Curve (AUC of PR-curve) and Average Precision (AP)
Is Average Precision (AP) the Area under Precision-Recall Curve (AUC of PR-curve) ?
EDIT:
here is some comment about difference in PR AUC and AP.
The AUC is obtained by trapezoidal interpolation of the precision. An
alternative and usually…

mrgloom
- 1,687
- 4
- 25
- 33
43
votes
4 answers
How do you use the 'test' dataset after cross-validation?
In some lectures and tutorials I've seen, they suggest to split your data into three parts: training, validation and test. But it is not clear how the test dataset should be used, nor how this approach is better than cross-validation over the whole…

Serhiy
- 959
- 1
- 8
- 11