I'm carrying out a regression problem where I am trying to predict quality based on other attributes of wine. (The quality data is the result of the median of 3 wine tasting experts rating each wine out of 10.
My problem: After carrying out a linear regression using sklearn, my coefficient of determination for the alcohol variable was only 0.2. To improve this:
- I have tried using multiple linear regression with several other variables (volatile acidity, density etc.) but at most can only get a correlation as high as 0.27.
- I have tried standardising and removing outliers. The steps I took to do this were a) finding the natural log b) finding the z-score c) removing those outside 1.5*IQR (Tukey's).
I've attached representations of the data for context.
- Am I using an appropriate algorithm? (based on the images attached)
- If linear regression is the most appropriate algorithm, how can I improve the results?
This study is part of a challenge where they have specifically asked to predict wine quality so I believe a stronger correlation should be possible.
Above I have attached two images - one for the correlation heat map and the other two are scatter plots of quality against other variables.