Highest Voted Questions - Statistical Analysis Stack Exchange

35

votes

11 answers

Why is generating 8 random bits uniform on (0, 255)?

I am generating 8 random bits (either a 0 or a 1) and concatenating them together to form an 8-bit number. A simple Python simulation yields a uniform distribution on the discrete set [0, 255]. I am trying to justify why this makes sense in my…

binomial-distribution random-generation uniform-distribution

asked Jan 16 '17 at 19:34

glassy

481
4
5

35

votes

3 answers

What algorithms need feature scaling, beside from SVM?

I am working with many algorithms: RandomForest, DecisionTrees, NaiveBayes, SVM (kernel=linear and rbf), KNN, LDA and XGBoost. All of them were pretty fast except for SVM. That is when I got to know that it needs feature scaling to work faster. Then…

machine-learning svm random-forest naive-bayes boosting

asked Nov 06 '16 at 15:09

Aizzaac

989
2
11
21

35

votes

3 answers

How to interpret root mean squared error (RMSE) vs standard deviation?

Let's say I have a model that gives me projected values. I calculate RMSE of those values. And then the standard deviation of the actual values. Does it make any sense to compare those two values (variances)? What I think is, if RMSE and…

standard-deviation standard-error rms

asked Oct 27 '16 at 16:20

jkim19

451
1
4
3

35

votes

3 answers

What is the most accurate way of determining an object's color?

I have written a computer program that can detect coins in a static image (.jpeg, .png, etc.) using some standard techniques for computer vision (Gaussian Blur, thresholding, Hough-Transform etc.). Using the ratios of the coins picked up from a…

image-processing

asked Feb 21 '12 at 13:40

MoonKnight

707
9
22

35

votes

2 answers

Dropping one of the columns when using one-hot encoding

My understanding is that in machine learning it can be a problem if your dataset has highly correlated features, as they effectively encode the same information. Recently someone pointed out that when you do one-hot encoding on a categorical…

regression machine-learning categorical-data discrete-data categorical-encoding

asked Aug 23 '16 at 13:51

dasboth

648
2
7
10

35

votes

5 answers

Think like a bayesian, check like a frequentist: What does that mean?

I am looking at some lecture slides on a data science course which can be found here: https://github.com/cs109/2015/blob/master/Lectures/01-Introduction.pdf I, unfortunately, cannot see the video for this lecture and at one point on the slide, the…

bayesian data-mining frequentist

asked Aug 16 '16 at 13:33

Luca

4,410
3
30
52

35

votes

5 answers

What are the dangers of violating the homoscedasticity assumption for linear regression?

As an example, consider the ChickWeight data set in R. The variance obviously grows over time, so if I use a simple linear regression like: m <- lm(weight ~ Time*Diet, data=ChickWeight) My questions: Which aspects of the model will be…

r regression heteroscedasticity assumptions

asked Feb 14 '12 at 15:50

Dan M.

830
1
7
11

35

votes

5 answers

Why do some people use -999 or -9999 to replace missing values?

I have a dataset. There are lots of missing values. For some columns, the missing value was replaced with -999, but other columns, the missing value was marked as 'NA'. Why would we use -999 to replace the missing value?

missing-data

asked Jul 22 '16 at 19:47

qqqwww

493
1
4
8

35

votes

3 answers

Classification/evaluation metrics for highly imbalanced data

I deal with a fraud detection (credit-scoring-like) problem. As such there is a highly imbalanced relation between fraudulent and non-fraudulent observations. http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html provides a great…

classification unbalanced-classes precision-recall cohens-kappa model-evaluation

asked Jul 07 '16 at 08:42

Georg Heiler

525
1
4
12

35

votes

2 answers

Interpretation of biplots in principal components analysis

I came across this nice tutorial: A Handbook of Statistical Analyses Using R. Chapter 13. Principal Component Analysis: The Olympic Heptathlon on how to do PCA in R language. I don't understand the interpretation of Figure 13.3: So I am plotting…

r pca data-visualization interpretation biplot

asked Aug 23 '10 at 09:48

user862

2,339
4
27
24

35

votes

1 answer

XGBoost Loss function Approximation With Taylor Expansion

As an example, take the objective function of the XGBoost model on the $t$'th iteration: $$\mathcal{L}^{(t)}=\sum_{i=1}^n\ell(y_i,\hat{y}_i^{(t-1)}+f_t(\mathbf{x}_i))+\Omega(f_t)$$ where $\ell$ is the loss function, $f_t$ is the $t$'th tree output…

optimization loss-functions boosting taylor-series

asked Mar 21 '16 at 19:04

Alex R.

13,097
2
25
49

35

votes

2 answers

10-fold Cross-validation vs leave-one-out cross-validation

I'm doing nested cross-validation. I have read that leave-one-out cross-validation can be biased (don't remember why). Is it better to use 10-fold cross-validation or leave-one-out cross-validation apart from the longer runtime for leave-one-out…

machine-learning cross-validation

asked May 31 '15 at 12:26

machinery

1,474
4
18
30

35

votes

2 answers

Is cosine similarity identical to l2-normalized euclidean distance?

Identical meaning, that it will produce identical results for a similarity ranking between a vector u and a set of vectors V. I have a vector space model which has distance measure (euclidean distance, cosine similarity) and normalization technique…

normalization natural-language euclidean cosine-distance cosine-similarity

asked Apr 13 '15 at 22:58

Arne

453
1
6
9

35

votes

5 answers

What is the difference between "mean value" and "average"?

Wikipedia explains: For a data set, the mean is the sum of the values divided by the number of values. This definition however corresponds to what I call "average" (at least that's what I remember learning). Yet Wikipedia once more quotes: There…

mean interpretation terminology

asked Aug 10 '11 at 15:36

neydroydrec

581
2
6
10

35

votes

4 answers

Ensemble of different kinds of regressors using scikit-learn (or any other python framework)

I am trying to solve the regression task. I found out that 3 models are working nicely for different subsets of data: LassoLARS, SVR and Gradient Tree Boosting. I noticed that when I make predictions using all these 3 models and then make a table of…

regression scikit-learn ensemble-learning

asked Feb 24 '15 at 14:29

Maksim Khaitovich

658
1
7
12

Most Popular