How should preprocessing scaling of data be done for features that exist for only some training instances?

Question

I have a data set of particle physics events. An event could be seen as a training instance. In these events, there are various particles and these particles have various characteristics (energy, momentum etc.). An example particle is an electron. Now, not every event contains an electron, but when it does, the characteristics of the electron are available. If an electron is in the event, then its characteristic values like momentum are saved (e.g. 107425.323473) and if no electron is in the event, then its characteristic values are set to some code number (e.g. -999).

How should data like this be preprocessed (e.g. sklearn.preprocessing.MinMaxScaler(feature_range = (-1, 1)))? I am keen to use the data with a variety of deep learning algorithms in TensorFlow.

In a sense, I am asking how TensorFlow could be told that certain values of a tensor (or an image, or however the data is formulated) are inactive.

Events without photons is like unmarried people, variables describing the partner or the marriage are irrelevant: https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model/372258#372258 — kjetil b halvorsen, Dec 15 '19 at 15:05

score 1 · Answer 1 · 2017-04-23T08:42:42.143

1

Dealing with missing values (inactive datapoints, NaNs or NAs) is a major challenge in machine leaning. There are two common strategies to deal with it.

Imputation by deletion: In this brute force approach you can just remove all the rows and columns with any missing values. As it is apparent this approach may lead to loss of information.
Imputation by substitution: Here missing values are substituted with ones inferred from existing data eg. mean or median of a feature.

These strategies are very well summarised in the Sklearn documentation: http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values. Note that the choice of the strategy would depend on the type of data and the specific objectives of machine learning.

edited Apr 23 '17 at 08:42

answered Apr 23 '17 at 05:46

Note that in the standard usage, it is **2.** that would be called [imputation](https://stats.stackexchange.com/questions/tagged/data-imputation)! – GeoMatt22 Apr 23 '17 at 08:18
Fixed! thanks. However, I feel its a kind of a confusing misnomer. – Apr 23 '17 at 08:41
1

I am no expert, but I would usually hear these as [deletion](https://en.wikipedia.org/wiki/Listwise_deletion) vs. [imputation](https://en.wikipedia.org/wiki/Imputation_(statistics)). So I have not heard of "imputation by deletion". The tag-info here, and Wikipedia, and your own scikit-learn link seem to all agree that "imputing" is *adding* new data (feature values), which is how I have heard it used. You can ask a [tag:terminology] question if you want the experts to weight in, though :) – GeoMatt22 Apr 23 '17 at 08:48
@rraadd88 Thanks for that information! Is there some reasonable way to tell TensorFlow that some values are *inactive* for some training cases? Like, it would be flat out wrong to put in a momentum value (e.g. some mean value) for an electron that doesn't exist for a given training case. If I were to exclude the training case, I am excluding physics events that I need to be able to consider (i.e. the bias of excluding would be flat out unscientific). In a sense, it is meaningful that there are no values for electron momentum in some training cases. – BlandCorporation Apr 24 '17 at 18:10
As far as I understand, missingness of the data (no electron) is actually meaningful in your case. So better would be to keep it in the data. You can use alternative ways of preprocessing the data while retaining all the information. 1. As you are doing rescaling betn 0 to 1, if you substitute the missing data by minimums of features (instead of extreme value -999), they would become 0 and still be there in the data. 2. Standardize (z score) the data instead of rescaling it. – Apr 25 '17 at 04:12
See also https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model/372258#372258 – kjetil b halvorsen Dec 15 '19 at 15:30

How should preprocessing scaling of data be done for features that exist for only some training instances?

1 Answers1