Is there a way to build a regression model for continuous output using aggregate data instead of individual data points when all input variables are categorical?
I have a moderately large dataset (few million rows). All my predictor variables are categorical or binary. I have two outcome variables - one binary and another continuous. For the binary variables I am using logistic regression. My R code is as follows:
rawdata <- readRDS(file = "mvt_week2.rds")
system.time(m <- glm(y ~ F1 + F2 + F3, data = rawdata, family=binomial))
# 900 seconds
If I aggregate the data first and build the model its almost instant.
library(dplyr)
sumdata <- rawdata %>% group_by(F1,F2,F3) %>% summarize(y1 = sum(y),Visits = n())
system.time(agg_m <- glm(y1/Visits ~ F1+ F2 + F3, data = sumdata, family=binomial, weights = Visits))
# 0.05 seconds
The model output is exactly the same and it saves a lot of time and requires much less memory and computation power.
My question is this works for a logistic model but is there a way to make this work for continuous output variables? If I calculate the mean and standard deviation can I feed that into a model?
Standard errors in weighted least squares on aggregated data A good explanation is given here when the output variable takes a finite set of values, but is there an approach that works for continuous values. Please note I do have raw data so I can compute any statistics on it when aggregating.