Is it statistically correct if my normalized data is highly concentrated inside a range?

Question

I'm building a simple neural network that has worked pretty well when passing variables that are measured in time. Now I want to add other quantitative variables to the model that aren't time measures, and I was told that in order to do that I needed to normalize my data.

I have normalized my time variables as suggest in this post and it was brought to my attention how small the ranges of my data have become:

Variable 1 goes from 0.0 to 9.727535e-01

Variable 2 goes from 0.092662 to 0.165131

EDIT:

Everything just seemed to be fine, the max and min values for each column were getting retrieved correctly, the arithmetic was fine. So I decided to print the max and min values of the transformed dataset, and they were 0.0 and 1.0 for all rows respectively.

Everything was fine all the time. I thought I had a mistake because I was printing my dataframe in the console and it first glance I said "woah, this doesn't look like it has 0.0 and 1.0 as min and max values", but the truth is that the console doesn't print every row of the dataframe if the number of rows surpasses certain number.

If you are trying to et a range of 0-1 I think you must have a mistake in your algorithm. — mdewey, May 17 '18 at 13:34
Impo, when you normalize time, and when you use it for machine learners, it helps if you add two columns per measure of cyclicity, one for the time cosine, one for the time sine. Week, month, quarter, year... if you have the sines/cosines for those, you can make the learner decide which is the relevant ones. — EngrStudent, May 17 '18 at 15:06
@EngrStudent I have to clarify that it's time measured in seconds. Not a date — Esteban Vargas, May 17 '18 at 15:19
@EstebanVargas - they are the same thing. You are saying the equivalent of "but I'm using the metric system, not Imperial" and I am saying "think about expressing it in terms of the speed of light". The physical system is going the same speed regardless of whether it is measured in meters per second or furlongs per fortnight. — EngrStudent, May 17 '18 at 16:13
I discourage you in the *strongest possible terms* from normalizing your data according to the post you linked (i.e., dividing by the range). That is a terrible idea because if you have outliers they will impact the scaled data dramatically, which will pretty much doom your model to non-generalizability. The proper way to do this is to subtract the mean and then divide by the standard deviation, which is much less sensitive to individual outliers. — Josh, May 17 '18 at 16:58

score 0 · Answer 1 · answered May 17 '18 at 12:04

It is not necessarily incorrect, it depends on what your original values are (assuming you are computing the right formula). I would recommend you share the histograms of unnormalized and normalized data (I write this recomendation here because I do not have permits to comment yet)

On the other hand, I would recomend you post a dedicated question about whether your approach for mixing variables is the correct one.

score 0 · Accepted Answer · answered May 17 '18 at 16:47

Normalizing the data means rescaling your inputs to the closed range [0,1]. As you may know, this is done to weight equitably all your input features. If you're getting your inputs rescaled into some other ranges, then your process is probably incorrect.

One reason why it could be failing is that you could be confusing normalization and standardization. Standardization is the process where the mean and scaling to unit variance. Nevertheless this would imply that your rescaled inputs mean is zero (0), which is not the case.

Another alternative might be that you're normalizing all features together, when it should be done independently. If this was the case, at least one input value would be equals to zero (the minimum) and a second equals to one (the maximum). Which is not the case.

Could you provide some code examples of what you're doing to normalize your data? I believe it is the fastest way we can help you.

I suggest you calculate the normalized values for the maximum and the minimum value only, for each input, using your dataset. Verify that return 0 and 1 respectively.

Is it statistically correct if my normalized data is highly concentrated inside a range?

2 Answers2