Most Popular

1500 questions
35
votes
11 answers

Why is generating 8 random bits uniform on (0, 255)?

I am generating 8 random bits (either a 0 or a 1) and concatenating them together to form an 8-bit number. A simple Python simulation yields a uniform distribution on the discrete set [0, 255]. I am trying to justify why this makes sense in my…
35
votes
3 answers

What algorithms need feature scaling, beside from SVM?

I am working with many algorithms: RandomForest, DecisionTrees, NaiveBayes, SVM (kernel=linear and rbf), KNN, LDA and XGBoost. All of them were pretty fast except for SVM. That is when I got to know that it needs feature scaling to work faster. Then…
Aizzaac
  • 989
  • 2
  • 11
  • 21
35
votes
3 answers

How to interpret root mean squared error (RMSE) vs standard deviation?

Let's say I have a model that gives me projected values. I calculate RMSE of those values. And then the standard deviation of the actual values. Does it make any sense to compare those two values (variances)? What I think is, if RMSE and…
jkim19
  • 451
  • 1
  • 4
  • 3
35
votes
3 answers

What is the most accurate way of determining an object's color?

I have written a computer program that can detect coins in a static image (.jpeg, .png, etc.) using some standard techniques for computer vision (Gaussian Blur, thresholding, Hough-Transform etc.). Using the ratios of the coins picked up from a…
MoonKnight
  • 707
  • 9
  • 22
35
votes
2 answers

Dropping one of the columns when using one-hot encoding

My understanding is that in machine learning it can be a problem if your dataset has highly correlated features, as they effectively encode the same information. Recently someone pointed out that when you do one-hot encoding on a categorical…
35
votes
5 answers

Think like a bayesian, check like a frequentist: What does that mean?

I am looking at some lecture slides on a data science course which can be found here: https://github.com/cs109/2015/blob/master/Lectures/01-Introduction.pdf I, unfortunately, cannot see the video for this lecture and at one point on the slide, the…
Luca
  • 4,410
  • 3
  • 30
  • 52
35
votes
5 answers

What are the dangers of violating the homoscedasticity assumption for linear regression?

As an example, consider the ChickWeight data set in R. The variance obviously grows over time, so if I use a simple linear regression like: m <- lm(weight ~ Time*Diet, data=ChickWeight) My questions: Which aspects of the model will be…
Dan M.
  • 830
  • 1
  • 7
  • 11
35
votes
5 answers

Why do some people use -999 or -9999 to replace missing values?

I have a dataset. There are lots of missing values. For some columns, the missing value was replaced with -999, but other columns, the missing value was marked as 'NA'. Why would we use -999 to replace the missing value?
qqqwww
  • 493
  • 1
  • 4
  • 8
35
votes
3 answers

Classification/evaluation metrics for highly imbalanced data

I deal with a fraud detection (credit-scoring-like) problem. As such there is a highly imbalanced relation between fraudulent and non-fraudulent observations. http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html provides a great…
35
votes
2 answers

Interpretation of biplots in principal components analysis

I came across this nice tutorial: A Handbook of Statistical Analyses Using R. Chapter 13. Principal Component Analysis: The Olympic Heptathlon on how to do PCA in R language. I don't understand the interpretation of Figure 13.3: So I am plotting…
user862
  • 2,339
  • 4
  • 27
  • 24
35
votes
1 answer

XGBoost Loss function Approximation With Taylor Expansion

As an example, take the objective function of the XGBoost model on the $t$'th iteration: $$\mathcal{L}^{(t)}=\sum_{i=1}^n\ell(y_i,\hat{y}_i^{(t-1)}+f_t(\mathbf{x}_i))+\Omega(f_t)$$ where $\ell$ is the loss function, $f_t$ is the $t$'th tree output…
Alex R.
  • 13,097
  • 2
  • 25
  • 49
35
votes
2 answers

10-fold Cross-validation vs leave-one-out cross-validation

I'm doing nested cross-validation. I have read that leave-one-out cross-validation can be biased (don't remember why). Is it better to use 10-fold cross-validation or leave-one-out cross-validation apart from the longer runtime for leave-one-out…
machinery
  • 1,474
  • 4
  • 18
  • 30
35
votes
2 answers

Is cosine similarity identical to l2-normalized euclidean distance?

Identical meaning, that it will produce identical results for a similarity ranking between a vector u and a set of vectors V. I have a vector space model which has distance measure (euclidean distance, cosine similarity) and normalization technique…
35
votes
5 answers

What is the difference between "mean value" and "average"?

Wikipedia explains: For a data set, the mean is the sum of the values divided by the number of values. This definition however corresponds to what I call "average" (at least that's what I remember learning). Yet Wikipedia once more quotes: There…
neydroydrec
  • 581
  • 2
  • 6
  • 10
35
votes
4 answers

Ensemble of different kinds of regressors using scikit-learn (or any other python framework)

I am trying to solve the regression task. I found out that 3 models are working nicely for different subsets of data: LassoLARS, SVR and Gradient Tree Boosting. I noticed that when I make predictions using all these 3 models and then make a table of…
Maksim Khaitovich
  • 658
  • 1
  • 7
  • 12