0

I'm carrying out a regression problem where I am trying to predict quality based on other attributes of wine. (The quality data is the result of the median of 3 wine tasting experts rating each wine out of 10.

My problem: After carrying out a linear regression using sklearn, my coefficient of determination for the alcohol variable was only 0.2. To improve this:

  • I have tried using multiple linear regression with several other variables (volatile acidity, density etc.) but at most can only get a correlation as high as 0.27.
  • I have tried standardising and removing outliers. The steps I took to do this were a) finding the natural log b) finding the z-score c) removing those outside 1.5*IQR (Tukey's).

I've attached representations of the data for context.

  1. Am I using an appropriate algorithm? (based on the images attached)
  2. If linear regression is the most appropriate algorithm, how can I improve the results?

This study is part of a challenge where they have specifically asked to predict wine quality so I believe a stronger correlation should be possible.

correlation heat map

variable correlations against quality part 1

variable correlations against quality part 2

Above I have attached two images - one for the correlation heat map and the other two are scatter plots of quality against other variables.

Jonny
  • 11
  • 2
  • Why do you need/want higher correlation on the single feqtures? If the goal is to predict the wine score, can't you use a linear predictive model on all or the best features? – Jon Nordby Jan 01 '19 at 11:43
  • "This study is part of a challenge where they have specifically asked to predict wine quality so I believe a stronger correlation should be possible." I don't quite follow your logic here. Often people will believe that predictive accuracy "should" be much higher than what the data allow. [How to know that your machine learning problem is hopeless?](https://stats.stackexchange.com/q/222179/1352) – Stephan Kolassa Jan 01 '19 at 13:21

0 Answers0