Questions tagged [data-preprocessing]

A step of cleaning data in data mining for analysis purposes

Data preprocessing is a data mining technique that involves transforming raw data into format which is handy for further analysis. Some issues which often arise are inconsistencies and missing values.

Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications. Data preprocessing is particularly important when you implement an Artificial Neural Network.

459 questions

votes

3 answers

One-hot vs dummy encoding in Scikit-learn

There are two different ways to encoding categorical variables. Say, one categorical variable has n values. One-hot encoding converts it into n variables, while dummy encoding converts it into n-1 variables. If we have k categorical variables, each…

asked Jul 16 '16 at 04:26

Munichong

1,645
3
15
26

votes

2 answers

Does random forest need input variables to be scaled or centered?

My input variables have different dimensions. Some variables are decimal while some are hundreds. Is it essential to center (subtract mean) or scale (divide by standard deviation) these input variables in order to make the data dimensionless when…

random-forest standardization data-preprocessing centering

asked Jan 12 '17 at 02:08

YQ.Wang

votes

4 answers

Cleaning data of inconsistent format in R?

I often deal with messy survey data which requires a lot of cleaning up before any statistics can be done. I used to do this "manually" in Excel, sometimes using Excel formulas, and sometimes checking entries one-by-one. I started doing more and…

r data-preprocessing

asked May 05 '12 at 09:07

mark999

3,180
2
22
31

votes

4 answers

Difference between preprocessing train and test set before and after splitting

Is there a difference between doing preprocessing for a dataset in sklearn before and after splitting data into train_test_split? In other words, are both of these approaches equivalent? from sklearn.preprocessing import StandardScaler from…

data-transformation scikit-learn data-preprocessing

asked Mar 12 '17 at 16:09

W.R.

votes

2 answers

Imputation of missing data before or after centering and scaling?

I want to impute missing values of a dataset for machine learning (knn imputation). Is it better to scale and center the data before the imputation or afterwards? Since the scaling and centering might rely on min and max values, in the first case…

machine-learning data-imputation centering data-preprocessing

asked Feb 18 '15 at 15:31

aldorado

votes

1 answer

State-of-the-art in deduplication

What are the state-of-the-art methods in record deduplication? Deduplication is also sometimes called: record linkage, entity resolution, identity resolution, merge/purge. I know for example about CBLOCK [1]. I would appreciate if answers also…

clustering data-preprocessing record-linkage

asked Mar 22 '14 at 19:52

Jakub Kotowski

votes

3 answers

Why can't scikit-learn SVM solve two concentric circles?

Consider the following dataset (code for generating it is at the bottom of the post): Running the following code: from sklearn.svm import SVC model_2 = SVC(kernel='rbf', degree=2, gamma='auto', C=100) model_2.fit(X_train, y_train) print('accuracy…

machine-learning python svm scikit-learn data-preprocessing

asked Jun 17 '20 at 11:35

Alexander Soare

votes

3 answers

What algorithms require one-hot encoding?

I'm never sure when to use one-hot encoding for non-ordered categorical variables and when not to. I use it whenever the algorithm uses a distance metric to compute similarity. Can anyone give a general rule of thumb as to what types of algorithms…

machine-learning categorical-data categorical-encoding data-preprocessing

asked Jun 30 '17 at 01:50

cosmosa

votes

2 answers

Neural Nets: One-hot variable overwhelming continuous?

I have raw data that has about 20 columns (20 features). Ten of them are continuous data and 10 of them are categorical. Some of the categorical data can have like 50 different values (U.S. States). After I pre-process the data the 10 continuous…

machine-learning neural-networks tensorflow data-preprocessing theano

asked Mar 07 '17 at 15:20

user1367204

votes

1 answer

Question about subtracting mean on train/valid/test set

I'm doing data preprocessing and going to build a Convonets on my data after. My question is: Say I have a total data sets with 100 images, I was calculating mean for each one of the 100 images and then subtract it from each of the images, then…

machine-learning cross-validation data-preprocessing

asked Dec 07 '15 at 23:22

Sam

votes

3 answers

What is the best way to Reshape/Restructure Data?

I am a research assistant for a lab (volunteer). I and a small group have been tasked with the data analysis for a set of data pulled from a large study. Unfortunately the data were gathered with an online app of some sort, and it was not programmed…

r excel data-preprocessing

asked Jun 23 '14 at 02:45

Wilkoe

votes

2 answers

What is bucketization?

I've been going around to find a clear explanation of "bucketization" in machine learning with no luck. What I understand so far is that bucketization is similar to quantization in digital signal processing where a range of continous values is…

machine-learning dataset data-preprocessing

asked May 20 '15 at 19:59

MedAli

votes

2 answers

Why do lots of people want to transform skewed data into normal distributed data for machine learning applications?

For image and tabular data, lots of people transform the skewed data into normally distributed data during preprocessing. What does the normal distribution mean in machine learning? Is it an essential assumption of machine learning algorithms?…

machine-learning normal-distribution data-preprocessing

asked Aug 01 '19 at 08:56

林彥良

votes

3 answers

Automatic data cleansing

A common problem is ML is poor quality of the data: errors in feature values, misclassified instances, etc etc. One way of addressing this problem is to manually go through the data and check, but are there other techniques? (I bet there are!)…

data-preprocessing

asked Feb 16 '12 at 07:18

andreister

3,257
17
29

votes

3 answers

Why standardization of the testing set has to be performed with the mean and sd of the training set?

In pre-processing the data set before applying a machine learning algorithm the data can be centered by subtracting the mean of the variable, and scaled by dividing by the standard deviation. This is a straightforward process in the training set,…

machine-learning centering data-preprocessing

asked Mar 18 '16 at 01:07

Antoni Parellada

23,430
15
100
197

2 3

…

30 31 Next