predict y variable that is only available as a sum by group

Question

I would like to fit a model in order to predict a y variable, however in the training data the y variable is only available for groups of items (the x variables are available for each individual iteam).

Here's an example of such a dataset created in R:

library(dplyr)
data("marketing", package = "datarium")
data_original <- tibble(marketing) %>% mutate(id = row_number()) %>% select(id, everything())
set.seed(10)
data_original$group_id <- sample.int(50, size = nrow(data_int), replace = T)

data_for_prediction <- data_original %>% 
  group_by(group_id) %>% 
  mutate(group_n = n(), 
         group_sales = sum(sales)) %>% ungroup() %>%
  select(-sales)
data_for_prediction

The data looks like this:

> data_for_prediction
# A tibble: 200 x 7
      id youtube facebook newspaper group_id group_n group_sales
   <int>   <dbl>    <dbl>     <dbl>    <int>   <int>       <dbl>
 1     1   276.     45.4       83.0       26       6       141. 
 2     2    53.4    47.2       54.1       16       2        25.4
 3     3    20.6    55.1       83.2       22       6        96.4
 4     4   182.     49.6       70.2       35       3        40.7
 5     5   217.     13.0       70.1        5       3        52.4
 6     6    10.4    58.7       90         12       7       122. 
 7     7    69      39.4       28.2       14       6       114. 
 8     8   144.     23.5       13.9       14       6       114. 
 9     9    10.3     2.52       1.2       31       4        70.3
10    10   240.      3.12      25.4       22       6        96.4
# … with 190 more rows

What is the proper way to estimate the sales for each individual line, while one only has the information of the sum of sales for each group ?

Do you have some prior information on distribution of $Y$? Can values be negative? Some external sample of individual $Y$ observations? ... Maybe EM-algorithm? Group testing might be an application https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3841014/ — kjetil b halvorsen, Feb 19 '21 at 20:14

score 3 · Answer 1 · answered Feb 19 '21 at 20:50

It depends on how you conceive of the original variables and what your model is.

In the commonest case, the response value $Y_i$ for each item $i$ is considered a random variable whose mean depends on the explanatory variables $x_{1i},\ldots, x_{pi},$ where (a) the responses are independent and (b) have a common variance $\sigma^2.$ That is, there are regression parameters $\beta_0, \beta_1, \ldots, \beta_p$ for which

$$E[Y_i] = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}\tag{*}$$

and $\operatorname{Var}(Y_i) = \sigma^2$ (which may be unknown; the point is that it doesn't vary between items).

Let $g(i)$ be the group in a set $\mathfrak G$ to which item $i$ is assigned. (Limit $\mathfrak G$ to the existing groups so that every group has at least one item in it.)

Suppose, as in the question, the group assignment is independent of all the variables. (This is crucial: in many applications, the individual responses $Y_i$ may be correlated within each group, in which case you need to supply information about that correlation in order to make progress.) What is the "group response"? Often it is a total (as in the question) or average of the responses of all the items in the group. Both are analyzed similarly, so for each $\gamma \in \mathfrak G$ let us set

$$\bar Y_\gamma = \sum_{i\mid g(i) = \gamma} Y_i$$

to be the total response for the items in group $\gamma$ (having $|\gamma|$ items). The basic laws of expectation and variance along with the model $(*)$ imply

$$\begin{aligned} E[\bar Y_\gamma] &= \sum_{i\mid g(i) = \gamma} E[Y_i]\\ &= \sum_{i\mid g(i) = \gamma} \left(\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{pi}\right)\\ &= |\gamma| \beta_0 + \beta_1 \bar{x}_{\gamma 1} + \cdots + \beta_p \bar{x}_{\gamma p} \end{aligned}$$

and

$$\begin{aligned} \operatorname{Var}[\bar Y_\gamma] &= \sum_{i\mid g(i) = \gamma} \operatorname{Var}[Y_i]\\ &= \sum_{i\mid g(i) = \gamma} \sigma^2\\ &=\sigma^2|\gamma| \end{aligned}$$

where

$$\bar x_{\gamma j} = \sum_{i\mid g(i) = \gamma} x_{ij}$$

are the sums of the explanatory variable values in the group. Moreover, assuming there is no overlap among separate groups, the independence of the $Y_i$ implies the independence of the $\bar Y_\gamma.$

With $|\gamma|\beta_0$ now playing the role of the constant term, this model is identical to $(*)$ except for one twist: the group responses $\bar Y_{\gamma}$ potentially have different variances, depending directly on the sizes of their groups. This makes it a classic Weighted linear regression model with weights proportional to $|\gamma|.$ (When all the group sizes are the same, this again is an ordinary regression.)

Using standard algorithms, you can fit (that is, estimate) the group mean responses $\bar Y_\gamma$ as well as the $\beta_j$ and $\sigma^2$ itself, provided you retain information about the group sizes $|\gamma|.$ Then, armed with an estimate of $\sigma^2,$ you can predict (a) individual $Y_i$ for $i\in\gamma;$ (b) individual $Y,$ independent of the data; or (c) responses for other groups (not in the data), provided you have values for their explanatory variables. "Prediction" in this sense involves erecting prediction limits around the estimated values: see What is the difference between prediction and estimation?

"in which case you need to supply information about that correlation in order to make progress" Could you clarify how this correlation should be considered / included in the model? Thanks, — jessexknight, Oct 29 '21 at 21:25
@jesse An example would be in time series analysis where, for instance, you might posit that the correlation between variables $Y_t$ and $Y_s$ is some given function of the *lag* $|t-s|,$ perhaps with some unknown parameters also to be estimated. A simple example of this (the AR-1 model) is that this correlation equals $\rho^{|s-t|}$ for some unknown $\rho \in [-1,1].$ I have just posted a picture of a general time series correlation structure at https://stats.stackexchange.com/a/550314/919. — whuber, Oct 29 '21 at 22:06

predict y variable that is only available as a sum by group

1 Answers1