Questions tagged [data-preprocessing]

A step of cleaning data in data mining for analysis purposes

Data preprocessing is a data mining technique that involves transforming raw data into format which is handy for further analysis. Some issues which often arise are inconsistencies and missing values.


Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications. Data preprocessing is particularly important when you implement an Artificial Neural Network.

459 questions
73
votes
3 answers

One-hot vs dummy encoding in Scikit-learn

There are two different ways to encoding categorical variables. Say, one categorical variable has n values. One-hot encoding converts it into n variables, while dummy encoding converts it into n-1 variables. If we have k categorical variables, each…
21
votes
2 answers

Does random forest need input variables to be scaled or centered?

My input variables have different dimensions. Some variables are decimal while some are hundreds. Is it essential to center (subtract mean) or scale (divide by standard deviation) these input variables in order to make the data dimensionless when…
YQ.Wang
  • 409
  • 1
  • 4
  • 11
16
votes
4 answers

Cleaning data of inconsistent format in R?

I often deal with messy survey data which requires a lot of cleaning up before any statistics can be done. I used to do this "manually" in Excel, sometimes using Excel formulas, and sometimes checking entries one-by-one. I started doing more and…
mark999
  • 3,180
  • 2
  • 22
  • 31
15
votes
4 answers

Difference between preprocessing train and test set before and after splitting

Is there a difference between doing preprocessing for a dataset in sklearn before and after splitting data into train_test_split? In other words, are both of these approaches equivalent? from sklearn.preprocessing import StandardScaler from…
W.R.
  • 253
  • 1
  • 3
  • 8
14
votes
2 answers

Imputation of missing data before or after centering and scaling?

I want to impute missing values of a dataset for machine learning (knn imputation). Is it better to scale and center the data before the imputation or afterwards? Since the scaling and centering might rely on min and max values, in the first case…
13
votes
1 answer

State-of-the-art in deduplication

What are the state-of-the-art methods in record deduplication? Deduplication is also sometimes called: record linkage, entity resolution, identity resolution, merge/purge. I know for example about CBLOCK [1]. I would appreciate if answers also…
Jakub Kotowski
  • 231
  • 2
  • 6
13
votes
3 answers

Why can't scikit-learn SVM solve two concentric circles?

Consider the following dataset (code for generating it is at the bottom of the post): Running the following code: from sklearn.svm import SVC model_2 = SVC(kernel='rbf', degree=2, gamma='auto', C=100) model_2.fit(X_train, y_train) print('accuracy…
13
votes
3 answers

What algorithms require one-hot encoding?

I'm never sure when to use one-hot encoding for non-ordered categorical variables and when not to. I use it whenever the algorithm uses a distance metric to compute similarity. Can anyone give a general rule of thumb as to what types of algorithms…
13
votes
2 answers

Neural Nets: One-hot variable overwhelming continuous?

I have raw data that has about 20 columns (20 features). Ten of them are continuous data and 10 of them are categorical. Some of the categorical data can have like 50 different values (U.S. States). After I pre-process the data the 10 continuous…
12
votes
1 answer

Question about subtracting mean on train/valid/test set

I'm doing data preprocessing and going to build a Convonets on my data after. My question is: Say I have a total data sets with 100 images, I was calculating mean for each one of the 100 images and then subtract it from each of the images, then…
Sam
  • 377
  • 2
  • 12
12
votes
3 answers

What is the best way to Reshape/Restructure Data?

I am a research assistant for a lab (volunteer). I and a small group have been tasked with the data analysis for a set of data pulled from a large study. Unfortunately the data were gathered with an online app of some sort, and it was not programmed…
Wilkoe
  • 139
  • 1
  • 5
11
votes
2 answers

What is bucketization?

I've been going around to find a clear explanation of "bucketization" in machine learning with no luck. What I understand so far is that bucketization is similar to quantization in digital signal processing where a range of continous values is…
MedAli
  • 257
  • 1
  • 4
  • 11
10
votes
2 answers

Why do lots of people want to transform skewed data into normal distributed data for machine learning applications?

For image and tabular data, lots of people transform the skewed data into normally distributed data during preprocessing. What does the normal distribution mean in machine learning? Is it an essential assumption of machine learning algorithms?…
10
votes
3 answers

Automatic data cleansing

A common problem is ML is poor quality of the data: errors in feature values, misclassified instances, etc etc. One way of addressing this problem is to manually go through the data and check, but are there other techniques? (I bet there are!)…
andreister
  • 3,257
  • 17
  • 29
10
votes
3 answers

Why standardization of the testing set has to be performed with the mean and sd of the training set?

In pre-processing the data set before applying a machine learning algorithm the data can be centered by subtracting the mean of the variable, and scaled by dividing by the standard deviation. This is a straightforward process in the training set,…
Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197
1
2 3
30 31