1

When you giving a data set as a table. The rows are the observation (e.g. measurement of different humans) and columns are your features (height, weight, ...) and one column is the one you want to predict and you want to do regression. From a statistical perspective what are here the random variables? Is here each column a random variable and each value is in this column a realization? what needs to be i.i.d.?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
user3680510
  • 161
  • 4
  • Why the downvote? This is a valid and important question. – Fabian Werner Aug 31 '18 at 10:26
  • @FabianWerner Can you answer the question? – user3680510 Sep 03 '18 at 08:11
  • Possible duplicate of [What is the difference between variable and random variable?](https://stats.stackexchange.com/questions/139989/what-is-the-difference-between-variable-and-random-variable) or https://stats.stackexchange.com/questions/246047/independent-variable-random-variable – Sycorax Sep 04 '18 at 04:53

2 Answers2

2

Regression models condition on the values of your explanatory variables, and model the response variable condition on this. Hence, the only thing that is treated as a random variable in this analysis is the column containing the response variable (i.e., the variable you are trying to predict). Usually we implement this in practice by writing each response variable as a function of the explanatory variables and a set of model parameters (which is linear in the parameters) and an "error term" that represents the deviation of the response variable from its conditonal expected value.

Writing matters in this standard way, the multiple linear regression model with $m$ explanatory variables has the following form:

$$Y_i = \beta_0 + \beta_1 x_{i,1} + \cdots + \beta_m x_{i,m} + \varepsilon_i \quad \quad \quad \varepsilon_i | \mathbf{x} \sim \text{IID N}(0,\sigma^2).$$

Each error term is defined as $\varepsilon_i = Y_i - \mathbb{E}(Y_i|\mathbf{x})$ and the object $\mathbf{x}$ is the design matrix, containing all the $x$ values in your analysis. In this model the error terms $\varepsilon_i$ are IID variables conditional on the explanatory variables, which means that the deviation of the response variable from its conditional expected value is considered to be IID across the observations.

Ben
  • 91,027
  • 3
  • 150
  • 376
0

I am tempted to say yes but unfortunately I have just my mobile phone at hand and it is a little clumsy to write answers. Short version: each row consists of a vector of data columns $x_i$ and a true answer $y_i$. We expect them to be realizations $x_i = X_i(\omega)$, $y_i=Y_i(\omega)$ of random variables $X_i, Y_i$ and we want the pairs $(X_i, Y_i)$ to be iid. The omega referred to above is the ‘true’ one and we try to determine if using some empirical form following the data. I imagine the prob space $\Omega$ to be the set of all possible states of the universe and $\omega$ the one in which we live. Then $X_i$ are projections on a certain part of the total state of the universe and $Y_i$ is almost a deterministic function (if it knew the total state of the universe then it would just be a deterministic function) but due to the fact that the $X_i$ hide some part of the state from us, we have to formulate it as $Y_i=f\circ X_i + \text{error}_i$ where the error is there in order to compensate that we have a deterministic function but a part of the state is hidden to it... the latter part is only there because it helps me reminding myself of the structure and there is no mathematical formal intention behind it. If it confuses you then just forget it ;-)

Fabian Werner
  • 3,055
  • 1
  • 9
  • 25