2

I am using the model-based recursive partitioning algorithm described in Zeileis, Hothorn and Hornik (2008), available here: https://www.zeileis.org/papers/Zeileis+Hothorn+Hornik-2008.pdf

I am using survey data, which requires that I use post-stratified weights to account for the over- or under-representation of certain subgroups in the sample. The weights in question are the inverse of the probability of being included in the sample, calibrated to the population total: the weights sum to the population total, so that the weight for each observation is the number of individuals in the population represented by each observation.

I am using the lmtree function in the partykit package in R, which allows for the inclusion of a vector of weights using the ‘weight’ option. Is it valid to specify post-stratified sampling weights using this option?

You can also specify whether or not the weights are to be treated as ‘case weights’. I am confused by this terminology (even after reading the discussion touching on the topic here: Defintion of the terms "node weight" and "case weight"). According to the package documentation, if case weights are used the number of observations is sum(weights), which sounds like what I want. But as noted in the linked discussion, it doesn’t seem to make much difference whether the weights are treated as case weights or not.

I also tried normalising the post-stratified weights (dividing each weight by the average weight). As I understand it, normalising the weights ensures that the sample size is correct and stops the standard errors from being smaller than they should be due to the weights making the sample size appear much larger than it is, but can’t take into account other aspects of the survey design (stratification, cluster sampling etc.). In practice this also didn’t make all that much difference, except to the number of observations in each node.

The algorithm runs and I am able to get a result, but I am not sure whether this is a good idea. I am OK with the standard errors of the model in each node being wrong, because I am using the algorithm as an intermediate step as a guide to detect interactions, which I then test in linear regression models. But I’m worried that the splits / partitions will also be spurious as a result. Would the weighting affect the structural change tests used to detect parameter instability (in this case the supLM statistic), and does the the algorithm correctly handle survey weights?

Kate
  • 23
  • 4

1 Answers1

3

The model will make splits 'in the right places', since it will choose splits to minimise the estimated population residual sum of squares. However, the criterion for stopping tree growth will be incorrect in general, because the criterion is based on the weights being inversely proportional to the error variances for each observation, as in lm.

This is fairly generally the problem with using statistical learning techniques on survey data: the techniques used to avoid overfitting will not handle survey weights correctly, and as a result the bias:variance tradeoff in model fitting will not be correctly calibrated. The situation will be even worse if the data are a multistage sample rather than just sampled individually with different probabilities

If you have external validation data, you could use that to tune model complexity rather than trying to use internal criteria such as the testing in partykit or crossvalidation.

Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73
  • 2
    Excellent, @ThomasLumley, I was hoping that you would see this question. Would there be an easy-to-integrate fix to also support _survey weights_ in addition to _case weights_ (default, `caseweights = TRUE`) and _precission weights_ (`caseweights = FALSE`)? I just had another look at https://notstatschat.rbind.io/2020/08/04/weights-in-statistics/ but don't know enough about survey models to understand how I would correctly integrate these insights into the `mob()` function underlying `lmtree()` & co. – Achim Zeileis Jan 27 '22 at 07:57
  • 2
    I don't know what the fix is. This is related to the complicated question of weights in mixed models: when you are doing any sort of regularisation, it's no longer sufficient to estimate the population objective function and optimise it, because the bias:variance tradeoff would be different in the population and in the sample. I have a student who's about to start a project on this question. Using the Rao-Scott working LR test from survey::anova.glm would probably be a step in the right direction, though. – Thomas Lumley Jan 27 '22 at 21:40
  • OK, thanks, I see. Adapting the MOB algorithm underlying `lmtree()` and `mob()` might be easier because it does not do any regularized estimation _per se_. It just uses a kind of score test to decide whether it should keep on splitting the data and fitting models on subsets or not. But I'm sure that devil is in the detail. If your student or @Kate or someone else wants to look at this, I would be interested and willing to help. – Achim Zeileis Jan 28 '22 at 02:57
  • 1
    The next version of the survey package implements score tests for survey-weighted glms – Thomas Lumley Jan 28 '22 at 03:12
  • Yes, thanks, I also read your blog post: https://notstatschat.rbind.io/2021/09/10/score-tests-surprisingly-annoying/ However, the test used in MOB is a bit non-standard because it is a maximally-selected single-shift test. So I'm not sure how easy it would be to build these on top of the score tests in survey. But it's certainly worth having a look. – Achim Zeileis Jan 28 '22 at 03:29
  • Yes, that's going to be hard. – Thomas Lumley Jan 30 '22 at 05:09