Feature engineering is the process of using domain knowledge of the data to create features for machine learning models. This tag is meant for both theoretical and practical questions regarding feature engineering, excluding questions asking for code, that would be off-topic on CrossValidated.
Questions tagged [feature-engineering]
698 questions
93
votes
6 answers
Principled way of collapsing categorical variables with many levels?
What techniques are available for collapsing (or pooling) many categories to a few, for the purpose of using them as an input (predictor) in a statistical model?
Consider a variable like college student major (discipline chosen by an undergraduate…

shadowtalker
- 11,395
- 3
- 49
- 109
35
votes
8 answers
how to represent geography or zip code in machine learning model or recommender system?
I am building a model and I think that geographic location is likely to be very good at predicting my target variable. I have the zip code of each of my users. I am not entirely sure about the best way to include zip code as a predictor feature in…

captain_ahab
- 1,301
- 1
- 12
- 21
33
votes
4 answers
Maximum Mean Discrepancy (distance distribution)
I have two data sets (source and target data) which follow different distributions. I am using MMD - that is a non-parametric distribution distance - to compute marginal distribution between the source and target data.
source data, Xs
target data,…

Mahsa
- 431
- 1
- 5
- 5
31
votes
3 answers
Utility of feature-engineering : Why create new features based on existing features?
I often see people create new features based on existing features on a machine learning problem.
For example, here : https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/ people have considered the size…

Matthieu Veron
- 443
- 4
- 7
29
votes
2 answers
When should we discretize/bin continuous independent variables/features and when should not?
When should we discretize/bin independent variables/features and when should not?
My attempts to answer the question:
In general, we should not bin, because binning will lose information.
Binning is actually increasing the degree of freedom of the…

Haitao Du
- 32,885
- 17
- 118
- 213
29
votes
2 answers
How to initialize the elements of the filter matrix?
I'm trying to better understand convolutional neural networks better by writing up Python code that doesn't depend on libraries (like Convnet or TensorFlow), and I'm getting stuck in the literature on how to choose values for the kernel matrix, when…

Kai Kuspa
- 291
- 1
- 3
- 3
25
votes
1 answer
What is "feature space"?
What is the definition of "feature space"?
For example,
When reading about SVMs, I read about "mapping to feature space".
When reading about CART, I read about "partitioning to feature space".
I understand what's going on, especially for CART, but I…

power
- 1,564
- 1
- 16
- 29
25
votes
2 answers
Autoencoders can't learn meaningful features
I have 50,000 images such as these two:
They depict graphs of data. I wanted to extract features from these images so I used autoencoder code provided by Theano (deeplearning.net).
The problem is, these autoencoders don't seem to learn any…

b93dh44
- 253
- 1
- 3
- 6
24
votes
1 answer
Optimal construction of day feature in neural networks
Working on regression problem I started to think about representation of "day of a week" feature. I wonder which approach would perform better:
one feature; value 1/7 for Monday; 2/7 for Tuesday...
7 features: (1, 0, 0, 0, 0, 0, 0) for Monday; (0,…

Oepas Dost
- 243
- 1
- 2
- 4
22
votes
3 answers
Why do neural networks need feature selection / engineering?
Particularly in the context of kaggle competitions I have noticed that model's performance is all about feature selection / engineering. While I can fully understand why that is in the case when dealing with the more conventional / old-school ML…

piotrwiercinski
- 415
- 4
- 8
22
votes
5 answers
Why does feature engineering work ?
Recently I have learned that one of ways for finding better solutions for ML problems is by creation of features. One can do that by for example summing two features.
For example, we possess two features "attack" and "defense" of some kind of hero.…

MrKadek750
- 223
- 2
- 5
21
votes
2 answers
Tutorials for feature engineering
As is known to all, feature engineering is extremely important to machine learning, however I found few materials associated with this area. I participated to several competitions in Kaggle and believe that good features may even be more important…

FindBoat
- 741
- 1
- 8
- 6
19
votes
5 answers
Is it better to do exploratory data analysis on the training dataset only?
I'm doing exploratory data analysis (EDA) on a dataset. Then I will select some features to predict a dependent variable.
The question is:
Should I do the EDA on my training dataset only? Or should I join the training and test datasets together…

Aboelnour
- 293
- 2
- 6
16
votes
2 answers
Mixing continuous and binary data with linear SVM?
So I've been playing around with SVMs and I wonder if this is a good thing to do:
I have a set of continuous features (0 to 1) and a set of categorical features that I converted to dummy variables. In this particular case, I encode the date of the…

user3010273
- 377
- 1
- 3
- 9
14
votes
1 answer
Feature construction and normalization in machine learning
Lets say I want to create a Logistic Classifier for a movie M.
My features would be something like age of the person, gender, occupation, location.
So training set would be something like:
Age Gender Occupation Location Like(1)/Dislike(0)
23 …

snow_leopard
- 345
- 2
- 12