1

http://archive.ics.uci.edu/ml/datasets/Wine+Quality I'm using this data set for a regression analysis project. Right now, I'm using free sulfur dioxide as my response, but the problem is, sulfur dioxide is often added by the winemaker during the wine-making process, so I'm not sure if this is a good response variable to use for regression analysis as it doesn't seem to be random (I think that's necessary for a response right?). I then looked at density as another possible response variable, but the density of all observations are somewhere between 0.99 and 1. If I'm not mistaken, I think that violates the unbounded requirement? Thus, I'm not sure if I should keep going with free sulfur dioxide as my response variable or find a new response variable. I'm simply trying to find relationship/s between the response and the predictors (the other variables in the data set), so as long as the response is valid, everything else is a go. If anyone could give me some tips, I would really appreciate it.

mistersunnyd
  • 595
  • 3
  • 13
  • 3
    The link states very clearly that the response is wine quality (an ordinal variable with 10 levels). Any reasons why you don't want to just use that as a response? – DeltaIV Mar 02 '17 at 18:22
  • Wine quality is categorical (0 to 10 rating). I need an unbounded and continuous response variable. – mistersunnyd Mar 02 '17 at 18:37
  • So what? Are you implying some regression method? If so, you must write it in your question. Otherwise there's nothing barring you from performing regression on a discrete numerical variable. – Firebug Mar 02 '17 at 18:57
  • I am allowed to use categorical predictors, but the response must be continuous, naturally occurring, and preferably unbounded. – mistersunnyd Mar 02 '17 at 20:25

1 Answers1

0
I'm using this data set for a regression analysis project.

Is this data set picked by you or your professor? As suggested in comments. Using quality column is not a good choice for regression. Since it is not a continuous number. Can you chose other data sets? For example, Boston Housing is widely used in regression demo. In addition, you can choose the task in UCI repository, here are all regression tasks.


From the comment I see you are locked on this data set, and the question is really about how can I pick a column to run regression, that make sense from "business perspective" and satisfy the assumptions on regression.

This is less likely to happen in real world, because in real world, the problem comes first. For example, suppose you are running a chemistry factory and want to offer the wine factory "sulfur dioxide" to be added. Then, running a regression on it is perfectly reasonable, since your interest is not about the wine quality, but how much "sulfur dioxide" you need to produce, given other conditions.

In addition, I think you have some confusions on the linear regression assumptions. where you mentioned

It doesn't seem to be random (I think that's necessary for a response right?)

What do you mean by it is necessary for response variable to be random? May be you feel "sulfur dioxide to be added" is determined by some other factors? so not random? Think about Boston housing case, is the housing price determined by other factors? So, as well as it is not deterministic (e.g., every one has the same value), then it is a random variable can be treated as a response variable.

We can define "sulfur dioxide to be added" as a random variable, and it is depending on other random variable regressors. Which is the exact point of the regression. In addition, for assumptions on regression, we are not assuming the response variable $Y$ satisfy any distribution. But only the residual satisfy normal distribution.

See this post for details Why linear regression has assumption on residual but generalized linear model has assumptions on response?

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • This data set was picked by me, but I've already locked in this data set as my choice and have to use it. I'm certainly not using quality as a response since that is categorical, but do you think there's a good variable in the data set that I can use as a response? – mistersunnyd Mar 02 '17 at 20:26
  • @mistersunnyd now it is much clear on your question. I will edit my answer to address it. – Haitao Du Mar 02 '17 at 20:43
  • Thank you for your answer. By random, I mean like naturally occurring. Since wine makers are adding sulfur dioxide to their wines (not sure if fixed amount or not), I felt like that would make sulfur dioxide a poor choice as a response. After reading your answer, I decided to go with chlorides as my response since chlorides in wine are caused by the salinity of the grapes and has nothing to do with being controlled. – mistersunnyd Mar 02 '17 at 21:50