Is it reasonable to do log transformation on both input and output variables in multioutput regression problem?

Question

I have been working on a small machine learning project, and I decided to use regression algorithms to solve the problem, however, I have encountered some problems in the project. Let me show some information about the data first.

Sample for data: https://i.imgur.com/dobPGjV.png

Skewness of input:

Total_mass 3.673021 CSWire 2.989812 CSBNR 6.708204 CSSheet 2.561045 LASGeneral 0.000000

note: except the first input feature(i.e. Total_mass) is numeric, other input features are binary.

Skewness of all the output variables:

RMS_E 2.900609 RMS_ECF 3.061536 PS_E 4.465471 PS_CF 4.461390 Total_CF 2.813939

Data shape: (45, 39)

I choes LOOCV method for cross validation, and scoring used is 'neg_mean_squared_error'

Due to the high positive skewness in those numeric input and outputs, I decided to apply log transformation.

The following are some results I got before I apply the log transformation with log base 0.01:

Model to be observed: ExtraTreeReg One time train-test-split RMSE: 0.1309910178784277 R2score: 0.9921648674086633 LOOCV Mean: 0.24206091710968433 Standard deviation: 0.7444181240405969

Model to be observed: RandForestReg One time train-test-split RMSE: 0.4138684511440412 R2score: 0.9296646783476964 LOOCV Mean: 0.2763118063179837 Standard deviation: 0.6038208015159212

Model to be observed: DecisionTreeReg One time train-test-split RMSE: 0.10133503115117631 R2score: 0.97932598029861 LOOCV Mean: 0.17969090902445078 Standard deviation: 0.47558247881583926

After log transformation with log base 0.01 was applied to 'Total_mass', 'RMS_E', 'RMS_ECF', 'PS_E', 'PS_CF', 'Total_CF'

Skewness of features:

Total_mass -0.226483 RMS_E -0.384771 RMS_ECF -0.367019 PS_E 0.200566 PS_CF 0.150737 Total_CF -0.201469

Model results:

Model to be observed: ExtraTreeReg One time train-test-split RMSE: 0.10554400685877796 R2score: 0.9263762125663704 LOOCV Mean: 0.07271451298683669 Standard deviation: 0.09802719638616668

Model to be observed: RandForestReg One time train-test-split RMSE: 0.16691774416704228 R2score: 0.8284156287973765 LOOCV Mean: 0.13346803792820663 Standard deviation: 0.11718363334681511

Model to be observed: DecisionTreeReg One time train-test-split RMSE: 0.20610638889224905 R2score: 0.7327556879115033 LOOCV Mean: 0.12482704851937168 Standard deviation: 0.1374598872952687

Question 1: It seems there is a huge improvement on the prediction according these 3 models, however, I'm not really sure if applying log transformation to output is allowed and whether the results are valid.

Question 2: My model has multi outputs and if I do the log transformation to all outputs, how can I interpret LOOCV Mean (for example) and its SD values that I got?

I would appreciate if any one question can be answered, thank you very much.

UPDATE:

I've just read this thread: In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?

It seems it is reasonable to do log transformation regardless input or output as long as the data is highly skewed. But, I come another question, does the model look like over-fitting?

If doesn't, the next step to be done should be selecting 2-3 algorithms which have good performance, then applying different ensemble methods to enhance those selected models, and finally tuning the hyperparameters, am I right?

Results of the model with untransformed variables and only 1 output feature 'Total_CF':

Model to be observed: ExtraTree One time train-test-split RMSE: 0.032427326748594255 R2score: 0.9868763161745056 LOOCV Mean: 0.06429231345199997 Standard deviation: 0.16596915107904692

Model to be observed: RandForest One time train-test-split RMSE: 0.08333369356252708 R2score: 0.9133288166659663 LOOCV Mean: 0.07519662305255939 Standard deviation: 0.1517216575168036

Model to be observed: DecisionTree One time train-test-split RMSE: 0.020437017516079452 R2score: 0.9947872307851834 LOOCV Mean: 0.06608551638333332 Standard deviation: 0.13634908323180886

I must be missing something because I don't understand the motivation behind transforming binary features. Moreover, of course the metrics related to error measurement will change as a function of the transformation. This makes direct comparisons of metrics between methods noncomparable. The predictions for one method or the other have to be backtransformed for comparability. Finally, the R-squares for the *before* models are much higher than the *after* models. — , May 02 '20 at 16:28
@user332577 I'm sorry that I forgot to mention the binary features are not required to be transformed. Besides, what if the RMSE and MEAN using transformed data are really close together and lower although the R2 score is lower, comparing with the model using untransformed data? Should I still go for the model with a lower R2 score? Anyway, thanks for your reply. — Lam, May 02 '20 at 16:49
How do you determine there is "a huge improvement"? You cannot use $R^2,$ SD, or RMSE to compare any of these models--they represent entirely different things when you have transformed the response. — whuber, May 02 '20 at 17:07
@whuber I thought the original and log transformed model could be directly compared using these metrics, thanks for your guidance. It seems it would be a tough job for comparing the actual performance of this two models, I'm still a newbie in statistic domain. — Lam, May 02 '20 at 17:59
@whuber If you compare nested models that have the same transformation of the response variable, then are $R^2$ and $RMSE$ as valid as usual? In that case, it’s as if we got handed the data already containing that transformation and were told to model it (it seems). — Dave, May 02 '20 at 18:10
@Dave No. The reason is that both of these measure completely different things: one of them is a reduction of variance of the response and the other measures reduction of variance of the log response. — whuber, May 02 '20 at 18:35
@whuber But if the models are nested with both predicting log response, then $R^2$ is measuring the proportion of explained variance of log response in both cases, isn’t it? — Dave, May 02 '20 at 18:46
@whuber Well, my original project aim is to only predict "Total_CF" column as it depends on the calculation of 'RMS_E', 'RMS_ECF', 'PS_E', 'PS_CF'. I removed these 4 output features from the model and trained the model again, it seems the results are acceptable? I thought if I included those 4 features in the output feature, they would have positive correlations with the 5th output feature 'Total_CF' because I got this value based on those 4 features, so that the model could predict more accurate, I thought I was wrong. The results of that model were appended at the end of the thread. — Lam, May 03 '20 at 01:42
@whuber I also want to know is there any statistical or logical way to determine if the algorithm(s) really does it job giving good performance? I made the conclusion based on RMSE and MSE since I saw a post used one-time train-test split to get RMSE and k-fold CV to get MSE. Is it reliable? Do we need to make any assumption? Especially for the extremely small datasets. You might get there by this link: https://www.kaggle.com/travelcodesleep/end-to-end-regression-pipeline-using-scikitlearn — Lam, May 03 '20 at 03:18
@Dave It's unclear in what sense the models might be "nested" because they propose radically different formulations of the errors. — whuber, May 03 '20 at 16:19
@whuber $log(y)=\beta_0+\beta_1x_1 +\epsilon$ is nested within $log(y)=\beta_0+\beta_1x_1+\beta_2x_2+\epsilon$, isn’t it? And if not, why not? If I wrote $z=$ instead of $log(y)=$, then they’d be nested. — Dave, May 03 '20 at 16:28
@Dave Yes, but AFAIK that's not the pair of models under discussion here. One of them is $y=\alpha_0+\alpha_1 x + \delta$ and the other is $\log(y)=\beta_0+\beta_1\log(x) + \varepsilon \ne \log(\alpha_0+\alpha_1 x + \delta).$ — whuber, May 03 '20 at 20:46

Is it reasonable to do log transformation on both input and output variables in multioutput regression problem?

0 Answers0