How can I improve my sklearn linear regression?

Question

I'm carrying out a regression problem where I am trying to predict quality based on other attributes of wine. (The quality data is the result of the median of 3 wine tasting experts rating each wine out of 10.

My problem: After carrying out a linear regression using sklearn, my coefficient of determination for the alcohol variable was only 0.2. To improve this:

I have tried using multiple linear regression with several other variables (volatile acidity, density etc.) but at most can only get a correlation as high as 0.27.
I have tried standardising and removing outliers. The steps I took to do this were a) finding the natural log b) finding the z-score c) removing those outside 1.5*IQR (Tukey's).

I've attached representations of the data for context.

Am I using an appropriate algorithm? (based on the images attached)
If linear regression is the most appropriate algorithm, how can I improve the results?

This study is part of a challenge where they have specifically asked to predict wine quality so I believe a stronger correlation should be possible.

Above I have attached two images - one for the correlation heat map and the other two are scatter plots of quality against other variables.

Why do you need/want higher correlation on the single feqtures? If the goal is to predict the wine score, can't you use a linear predictive model on all or the best features? — Jon Nordby, Jan 01 '19 at 11:43
"This study is part of a challenge where they have specifically asked to predict wine quality so I believe a stronger correlation should be possible." I don't quite follow your logic here. Often people will believe that predictive accuracy "should" be much higher than what the data allow. [How to know that your machine learning problem is hopeless?](https://stats.stackexchange.com/q/222179/1352) — Stephan Kolassa, Jan 01 '19 at 13:21

How can I improve my sklearn linear regression?

0 Answers0