I have been working on a small machine learning project, and I decided to use regression algorithms to solve the problem, however, I have encountered some problems in the project. Let me show some information about the data first.
Sample for data: https://i.imgur.com/dobPGjV.png
Skewness of input:
Total_mass 3.673021
CSWire 2.989812
CSBNR 6.708204
CSSheet 2.561045
LASGeneral 0.000000
note: except the first input feature(i.e. Total_mass) is numeric, other input features are binary.
Skewness of all the output variables:
RMS_E 2.900609
RMS_ECF 3.061536
PS_E 4.465471
PS_CF 4.461390
Total_CF 2.813939
Data shape:
(45, 39)
I choes LOOCV method for cross validation, and scoring used is 'neg_mean_squared_error'
Due to the high positive skewness in those numeric input and outputs, I decided to apply log transformation.
The following are some results I got before I apply the log transformation with log base 0.01:
Model to be observed: ExtraTreeReg
One time train-test-split RMSE: 0.1309910178784277
R2score: 0.9921648674086633
LOOCV Mean: 0.24206091710968433
Standard deviation: 0.7444181240405969
Model to be observed: RandForestReg
One time train-test-split RMSE: 0.4138684511440412
R2score: 0.9296646783476964
LOOCV Mean: 0.2763118063179837
Standard deviation: 0.6038208015159212
Model to be observed: DecisionTreeReg
One time train-test-split RMSE: 0.10133503115117631
R2score: 0.97932598029861
LOOCV Mean: 0.17969090902445078
Standard deviation: 0.47558247881583926
After log transformation with log base 0.01 was applied to 'Total_mass', 'RMS_E', 'RMS_ECF', 'PS_E', 'PS_CF', 'Total_CF'
Skewness of features:
Total_mass -0.226483
RMS_E -0.384771
RMS_ECF -0.367019
PS_E 0.200566
PS_CF 0.150737
Total_CF -0.201469
Model results:
Model to be observed: ExtraTreeReg
One time train-test-split RMSE: 0.10554400685877796
R2score: 0.9263762125663704
LOOCV Mean: 0.07271451298683669
Standard deviation: 0.09802719638616668
Model to be observed: RandForestReg
One time train-test-split RMSE: 0.16691774416704228
R2score: 0.8284156287973765
LOOCV Mean: 0.13346803792820663
Standard deviation: 0.11718363334681511
Model to be observed: DecisionTreeReg
One time train-test-split RMSE: 0.20610638889224905
R2score: 0.7327556879115033
LOOCV Mean: 0.12482704851937168
Standard deviation: 0.1374598872952687
Question 1: It seems there is a huge improvement on the prediction according these 3 models, however, I'm not really sure if applying log transformation to output is allowed and whether the results are valid.
Question 2: My model has multi outputs and if I do the log transformation to all outputs, how can I interpret LOOCV Mean (for example) and its SD values that I got?
I would appreciate if any one question can be answered, thank you very much.
UPDATE:
I've just read this thread: In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?
It seems it is reasonable to do log transformation regardless input or output as long as the data is highly skewed. But, I come another question, does the model look like over-fitting?
If doesn't, the next step to be done should be selecting 2-3 algorithms which have good performance, then applying different ensemble methods to enhance those selected models, and finally tuning the hyperparameters, am I right?
Results of the model with untransformed variables and only 1 output feature 'Total_CF':
Model to be observed: ExtraTree
One time train-test-split RMSE: 0.032427326748594255
R2score: 0.9868763161745056
LOOCV Mean: 0.06429231345199997
Standard deviation: 0.16596915107904692
Model to be observed: RandForest
One time train-test-split RMSE: 0.08333369356252708
R2score: 0.9133288166659663
LOOCV Mean: 0.07519662305255939
Standard deviation: 0.1517216575168036
Model to be observed: DecisionTree
One time train-test-split RMSE: 0.020437017516079452
R2score: 0.9947872307851834
LOOCV Mean: 0.06608551638333332
Standard deviation: 0.13634908323180886