Dealing with data with high variance

Question

I've a scaling problem. Let's say my target variable is a net revenue column and it has some range of (-34624455, 298878399). So the max-min value is 333502854.

Now in the test set, I have a record and it's revenue value is 2185 which when normalized, converts to 0.1038.

For this record, the predicted value when used a simple linear regression is 0.1037 (unlikely, but let's just assume). This converts to -40209.0402 which is no where near the actual value 2185. I understand that this is because of the crazy range that I've got, but how do I scale this sort of data ? I've tried removing the outliers thinking that it might help in reducing the effect of the range but even in the subset with no outliers, the range is pretty crazy and I still see the same effects where the predicted value in it's normalized/scaled form is close enough to the normalized/scaled actual but when I convert it to original scale, the data is not even close. What kind of scaling techniques should I use for this kind of data?

I used a simple scaling method for now which is (x-min)/(max-min)

Steps listed below:

2185 - (-34624455) = 34626640   # Subtracting the min value
34626640 / 333502854 = 0.103827117  # Dividing with the range

Assume the predicted value is 0.1037

0.1037*333502854 = 34584245.96  #multiplying with the range
34584245.96 + (-34624455) = -40209.0402  # Adding the min value

If I assume the predicted value to be 0.103827116 which is is exactly same as the actual value up until 8 point precision, then the invert scaled value is similar to the actual.

Hope this makes it a bit more clear the problem I am having. I am looking for some pointers on some more appropriate scaling methods as clearly, the min-max and standardized scaling technique are not working for this dataset.

It appears that the de-scaling procedure is not the exact reverse of the scaling procedure. I suggest trying with a simpler test case to verify each step in your procedure, using a small amount of simpler data. — James Phillips, Feb 09 '18 at 13:18
To put it flatly: you merely need to add and multiply correctly. — whuber, Feb 09 '18 at 15:07
Your calculations are erroneous because you are working with nine-digit numbers and have lost the fourth significant digit: the correct normalized value is 0.1038271, not 0.1037. What, then, is the problem? — whuber, Feb 09 '18 at 19:45
I thought I mentioned already on what I was looking for when I posted this question originally. I understand why the data is not being scaled properly. My issue is not with the calculation. I edited my question to add the calculation details to point that I am not doing anything "wrong" not only with my multiplications and divisions but also with the my reversing of the scaling process. I also agree that they are indeed the expected values. — marshal, Feb 09 '18 at 21:08
My question was more about how to deal with this sort of data. Would I have to arrive at predictions with a 9 digit floating point accuracy to be able to have some approximate actual values ? Or is there any other scaling approach that I can use? — marshal, Feb 09 '18 at 21:08

score 0 · Answer 1 · answered Feb 09 '18 at 14:14

I am not able reproduce the problem you are describing. When testing your problem in python with the following code:

from sklearn.preprocessing import MinMaxScaler

data = [[-34624455], [2185] ,[298878399]]
scaler = MinMaxScaler(feature_range=(0, 1))
print(scaler.fit(data))

data = scaler.transform(data)
print("transform: \n", data)
data = scaler.inverse_transform(data)
print("inverse: \n", data)

I get the following output:

MinMaxScaler(copy=True, feature_range=(0, 1))
transform: [[ 0. ] [ 0.10382712] [ 1. ]]
inverse: [[ -3.46244550e+07] [ 2.18500000e+03] [ 2.98878399e+08]]

which seems to be exactly the behavior we want from the scaler.
However, when I tried the scaling myself using a pocket-calculator, I also got a different result. In that case I would assume it has something to do with the finite precision of floating point arithmetic.
How did you implement the scaling and what program did you use?

Another scaling technique, which you probably already know is mean-scaling, which shouldn't have the same problem. It is difficult however to recommend scaling techniques as I don't know what you need the scaled variables for? Scaling input variables is quite common, but there are not that many uses for scaled target variables (see the discussion here).

Try applying the applying the inverse scale on a value close to 0.10382712. Something like 0.1037 then you will see that the actual values are not even close. And that link was helpful. Maybe I will give a try without applying any kind of scaling techniques. — marshal, Feb 09 '18 at 18:09

Dealing with data with high variance

1 Answers1