I would like to fit a model in order to predict a y variable, however in the training data the y variable is only available for groups of items (the x variables are available for each individual iteam).
Here's an example of such a dataset created in R:
library(dplyr)
data("marketing", package = "datarium")
data_original <- tibble(marketing) %>% mutate(id = row_number()) %>% select(id, everything())
set.seed(10)
data_original$group_id <- sample.int(50, size = nrow(data_int), replace = T)
data_for_prediction <- data_original %>%
group_by(group_id) %>%
mutate(group_n = n(),
group_sales = sum(sales)) %>% ungroup() %>%
select(-sales)
data_for_prediction
The data looks like this:
> data_for_prediction
# A tibble: 200 x 7
id youtube facebook newspaper group_id group_n group_sales
<int> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 1 276. 45.4 83.0 26 6 141.
2 2 53.4 47.2 54.1 16 2 25.4
3 3 20.6 55.1 83.2 22 6 96.4
4 4 182. 49.6 70.2 35 3 40.7
5 5 217. 13.0 70.1 5 3 52.4
6 6 10.4 58.7 90 12 7 122.
7 7 69 39.4 28.2 14 6 114.
8 8 144. 23.5 13.9 14 6 114.
9 9 10.3 2.52 1.2 31 4 70.3
10 10 240. 3.12 25.4 22 6 96.4
# … with 190 more rows
What is the proper way to estimate the sales for each individual line, while one only has the information of the sum of sales for each group ?