5

Is there a way to build a regression model for continuous output using aggregate data instead of individual data points when all input variables are categorical?

I have a moderately large dataset (few million rows). All my predictor variables are categorical or binary. I have two outcome variables - one binary and another continuous. For the binary variables I am using logistic regression. My R code is as follows:

rawdata <- readRDS(file = "mvt_week2.rds")

system.time(m <- glm(y ~ F1 + F2 + F3, data = rawdata, family=binomial))
# 900 seconds

If I aggregate the data first and build the model its almost instant.

library(dplyr)
sumdata <- rawdata %>% group_by(F1,F2,F3) %>% summarize(y1 = sum(y),Visits = n())
system.time(agg_m <- glm(y1/Visits ~ F1+ F2 + F3, data = sumdata, family=binomial, weights = Visits))
# 0.05 seconds

The model output is exactly the same and it saves a lot of time and requires much less memory and computation power.

My question is this works for a logistic model but is there a way to make this work for continuous output variables? If I calculate the mean and standard deviation can I feed that into a model?

Standard errors in weighted least squares on aggregated data A good explanation is given here when the output variable takes a finite set of values, but is there an approach that works for continuous values. Please note I do have raw data so I can compute any statistics on it when aggregating.

Rohit Das
  • 176
  • 3
  • 1
    I am having difficulty believing you get "exactly the same" output. At a minimum, the output had better report that it is using fewer data points! (If not, then you're not aggregating--you're just changing the way in which the data are being input to `glm`.) As a result, *all* measures of spread and variation, as well as associated tests, ought to change too. The [Ecological Fallacy](https://en.wikipedia.org/wiki/Ecological_fallacy) is the mistake of supposing that regression based on aggregated data can be interpreted as a regression on the original data. – whuber Aug 21 '15 at 12:55
  • @whuber: I deleted my answer, because according to you one can show (analogous to ANOVA) that the quantities at the last line of my answer are equal. I kindly invite you to show me this, because I think that is impossible. –  Aug 21 '15 at 17:56
  • 2
    @fcoppens I did not claim that. But notice that if a set of $k$ observations $y_i$ is associated with a single $x$ and the fit at $x$ is $\hat y=x\hat\beta$, then $$\sum(y_i-\hat y)^2=\sum(y_i-\bar y)^2+k(\bar y-\hat y)^2$$ shows that minimizing the left hand side is equivalent to minimizing the second term on the right hand side, because the first term does not vary. Therefore the sum of squares of residuals is minimized for the same value of $\hat\beta$ when using the aggregated data $(x,\bar y)$ provided either all the $k$ are the same or you weight the aggregated data by the $k$. – whuber Aug 21 '15 at 18:41
  • 1
    @whuber: ok, I accept this as the proof that both quantities at the end of my deleted answer are equal and therefore I agree with you that my answer was wrong. (+1) –  Aug 21 '15 at 19:59
  • @whuber: I tried to understand the formula in your comment above. Probably I misunderstood, but 'in general' your $k$ will depend on $\hat{y}$ and thus on $\hat{\beta}$. So in general they will not be the same and in that case your proposal is to use them as weights in a *weighted least squares* (WLS) So you try to estimate $\hat{\beta}$ with WLS and with weights that depend on $\hat{\beta}$. This implies that you will have an iterative procedure, hopefully one that converges, but - see the question - are you sure that it consumes less memory/CPU ? –  Aug 22 '15 at 04:53
  • @RohitDas: frequency weighted least squares gives *identical results* when $Y$ achieves only one value per covariate level or if it is included as a factoring level. Precision weighted least squares will give *approximately similar results* using the inverse variance of $Y$ at each covariate level if $Y$ is summarized with the mean. Under some very rare conditions, precision weighted LS may give identical results, but it would be a mathematical anomale. Is that the answer you were hoping for? – AdamO Dec 18 '17 at 22:55

0 Answers0