2

I am interested in better understanding the M-fluctuation test of the MOB algorithm (Zeileis, Hothorn & Hornik, 2008). I have a question regarding the definition of the empirical fluctuation process, $W_{j}(t)$, and then the supLM test that is constructed using it.

Following the original notation from the authors:


To assess the parameter instability, a natural idea is to check whether the scores $\hat{\psi}$ fluctuate randomly around their mean 0 or exhibit systematic deviations from 0 over $Z_j$. These deviations can be captured by the empirical fluctuation process.

$$W_{j}(t) = \hat{J}^{-1/2}n^{-1/2}\sum_{i=1}^{ \lfloor {n\cdot t}\rfloor} \hat{\psi}_{\sigma(Z_{ij})} \;\;\; \;\;\;\left(0\leq t \leq 1\right) $$

where $\sigma(Z_{ij})$ is the ordering permutation which gives the antirank of the observation $Z_{ij}$ in the vector $Zj = (Z_{1j} ; ... ;Z_{nj})$. Thus, $W_{j}(t)$ is simply the partial sum process of the scores ordered by the variable $Z_j$ , scaled by the number of observations $n$ and a suitable estimate $\hat{J}$ of the covariance matrix $COV(\psi(Y,\hat{\theta})$, e.g., $\hat{J} = n^{-1}\sum_{i=1}^{n}\psi(Y,\hat{\theta})\psi(Y,\hat{\theta})^{T}$.


Afterward, they define the supLM

$$\lambda_{supLM}(W_{j}) \underset{i = \underline{i},\dots,\overline{i} }{\operatorname{max}}\ \left(\dfrac{i}{n} \cdot\dfrac{n-i}{n}\right)^{-1} \left\| W_{j}\left( \dfrac{i}{n} \right) \right\|_{2}^{2} $$

As defined in the previous equation, we compute the empirical fluctuation process, $W_{j}(t)$, for each of the possible $i$ from $ \underline{i}$ to $\overline{i}$. Accordingly, this is below is presented how the statistic is computed by the mob() function from partykit. A full extended code taken from the source code is presented at the bottom of the question.

# epf sorted by partition variable $i$.
proci <- process[oi, , drop = FALSE]
 proci
 # Sorted score functions by Z1
 #       x1    [,1]    x2    [,2]
 # 1   1.802848e-01  1.547066e-01
 # 6   3.297723e-01  3.695268e-01
 # 9  -2.760902e-01  2.523460e-01
 # 5  -6.849006e-02  3.517448e-01
 # 2   1.731908e-01  3.782705e-01
 # 4   4.730456e-01  2.559946e-01
 # 7   6.125751e-01 -6.261254e-01
 # 8   6.992914e-01 -2.938750e-01
 # 10  9.295788e-01 -2.423813e-01
 # 3  -8.780647e-07  5.876067e-07

# Trimming parameter based on $t$ or minimum number of observations.
from_to <- tt0 >= from & tt0 <= to
from_to
 ## FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

# Computing the statistic for partition variable $i$.
stat[i] <- if(sum(from_to) > 0L) {
    xx <- rowSums(proci^2)   
    xx <- xx[from_to]
    xx
    ##         6         9         5         2         4         7         8 
    ## 0.2452998 0.1399043 0.1284153 0.1730836 0.2893054 0.7672812 0.5753710 

    tt <- tt0[from_to]/nobs
    tt
    ## [1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8
    max(xx/(tt * (1 - tt)))#<-finally it's only the maximum of the 
                           #sorted process scaled by 
                           #(tt * (1 - tt))^{-1}
      }

The Question.

However, I am quite confused by the notation used to describe the empirical fluctuation process.

First, regarding the dimensions of $W_{j}(t)$ it is not stated in the original article but in Merkle, Fan & Zeileis (2014) the authors claim that $W_{j}(t)$ is an $N\times k $ matrix (number of individuals $\times$ number of regressors), but we can see that we have the product of $\hat{J}^{1/2}$ from the left which is $k\times k$ multiplied by $\sum_{i=1}^{ \lfloor {n\cdot t}\rfloor} \hat{\psi}_{\sigma(Z_{ij})}$ which is $k\times N$, hence the matrix $W_{j}(t)$ is $k\times N$.

I can also see from the code that in process <- t(J12 %*% t(process)) there is a transpose (t()) over the object process which contains ordered score functions based on the partition variable. So, following what is written on the code. In order to get a conformable matrix product we should write the empirical fluctuation process as follow:

$$W_{j}(t) = \left( \hat{J}^{-1/2}\cdot n^{-1/2} \cdot \left( \sum_{i=1}^{ \lfloor {n\cdot t}\rfloor} \hat{\psi}_{\sigma(Z_{ij})}\right)^{T}\right)^{T} \;\;\; \;\;\;\left(0\leq t \leq 1\right) $$

Additionally, an equivalent (but more parsimonious) expression would be:

$$W_{j}(t) = \left( \sum_{i=1}^{ \lfloor {n\cdot t}\rfloor} \hat{\psi}_{\sigma(Z_{ij})}\right) \cdot \hat{J}^{-1/2}\cdot n^{-1/2} \;\;\; \;\;\;\left(0\leq t \leq 1\right) $$

PS: Here assuming that the score functions are in the form of a $1\times k$ row vectors, and the staked score functions are $N \times k$.


Replication code

Here are some functions are taken from the original source code of partykit. In particular, the most relevant is the function called named mob_grow_fluctests() which computes the stability tests.

#https://github.com/cran/partykit/blob/ec6f3b2c98ca126b6de86fe654b2cc349b551ea9/R/modelparty.R#L209
library(sandwich)
library(strucchange) ## for root.matrix()
library(partykit)    ## for mob_beta_suplm (in order to compute 
                     ## p-values)

mob_grow_fluctests <- function(estfun, z, weights, obj = NULL, 
                               cluster = NULL)
{  
  ## set up return values
  m <- NCOL(z)
  pval <- rep.int(NA_real_, m)
  stat <- rep.int(0, m)
  ifac <- rep.int(FALSE, m)
  ## variables to test
  mtest <- if(m <= control$mtry) 1L:m else sort(sample(1L:m, 
                  control$mtry))
  
  ## estimating functions (dropping zero weight observations)
  process <- as.matrix(estfun)
  ww0 <- (weights > 0)
  process <- process[ww0, , drop = FALSE]
  z <- z[ww0, , drop = FALSE]
  k <- NCOL(process)
  n <- NROW(process)
  nobs <- if(control$caseweights && any(weights != 1L)) 
                    sum(weights) else n
  
  ## scale process
  process <- process/sqrt(nobs) ### scaled process HERE (?) 
  vcov <- control$vcov
  if(is.null(obj)) vcov <- "opg"
  if(vcov != "opg") {
bread <- vcov(obj) * nobs
  }
  if(vcov != "info") {
## correct scaling of estfun for variance estimate:
## - caseweights=FALSE: weights are integral part of the estfun 
## -> squared in estimate
## - caseweights=TRUE: weights are just a factor in variance 
## estimate -> require division by sqrt(weights)
meat <- if(is.null(cluster)) {
  crossprod(if(control$caseweights) process/sqrt(weights) else 
        process)
} else {
  ## nclus <- length(unique(cluster)) ## nclus / (nclus - 1L) * 
  crossprod(as.matrix(apply(if(control$caseweights) 
       process/sqrt(weights) else process, 2L, tapply, 
        as.numeric(cluster), sum)))
    }
  }
  J12 <- root.matrix(switch(vcov,
                            "opg" = chol2inv(chol(meat)),
                            "info" = bread, 
                            "sandwich" = bread %*% meat %*% bread
  ))
  process <- t(J12 %*% t(process)) ## NOTE: loses column names
  
  ## select parameters to test
  if(!is.null(control$parm)) {
if(is.character(control$parm)) colnames(process) <- 
   colnames(estfun)
process <- process[, control$parm, drop = FALSE]
  }
  k <- NCOL(process)
  
  ## get critical values for supLM statistic
  from <- if(control$trim > 1) control$trim else ceiling(nobs * 
     control$trim)
  from <- max(from, minsize)
  to <- nobs - from
  lambda <- ((nobs - from) * to)/(from * (nobs - to))
  
  beta <- partykit:::mob_beta_suplm
    
  logp.supLM <- function(x, k, lambda)
  {
    if(k > 40L) {
      ## use Estrella (2003) asymptotic approximation
      logp_estrella2003 <- function(x, k, lambda)
        -lgamma(k/2) + k/2 * log(x/2) - x/2 + log(abs(log(lambda) * 
              (1 - k/x) + 2/x))
      ## FIXME: Estrella only works well for large enough x
      ## hence require x > 1.5 * k for Estrella approximation and
      ## use an ad hoc interpolation for larger p-values
      p <- ifelse(x <= 1.5 * k, (x/(1.5 * k))^sqrt(k) * 
        logp_estrella2003(1.5 * k, k, lambda), logp_estrella2003(x, 
         k, lambda))
    } else {
      
      ## use Hansen (1997) approximation
      nb <- ncol(beta) - 1L
      tau <- if(lambda < 1) lambda else 1/(1 + sqrt(lambda))
      beta <- beta[(((k - 1) * 25 + 1):(k * 25)),]
      #browser()
      dummy <- beta[,(1L:nb)] %*% x^(0:(nb-1))
      dummy <- dummy * (dummy > 0)
      pp <- pchisq(dummy, beta[,(nb+1)], lower.tail = FALSE, 
             log.p = TRUE)
      if(tau == 0.5) {
        p <- pchisq(x, k, lower.tail = FALSE, log.p = TRUE)
      } else if(tau <= 0.01) {
        p <- pp[25L]
      } else if(tau >= 0.49) {
        p <- log((exp(log(0.5 - tau) + pp[1L]) + 
           exp(log(tau - 0.49) + pchisq(x, k, lower.tail = FALSE, 
                log.p = TRUE))) * 100)
        ## if p becomes so small that 'correct' weighted averaging 
        ## does not work, resort to 'naive' averaging
        if(!is.finite(p)) p <- mean(c(pp[1L], pchisq(x, k, 
            lower.tail = FALSE, log.p = TRUE)))
      } else {
        taua <- (0.51 - tau) * 50
        tau1 <- floor(taua)
        p <- log(exp(log(tau1 + 1 - taua) + pp[tau1]) + 
               exp(log(taua-tau1) + pp[tau1 + 1L]))
        ## if p becomes so small that 'correct' weighted averaging 
        ## does not work, resort to 'naive' averaging
        if(!is.finite(p)) p <- mean(pp[tau1 + 0L:1L])
      }
    }
    return(as.vector(p))
  }
  
  ## compute statistic and p-value for each ordering
  for(i in mtest) {
    zi <- z[,i]
    if(length(unique(zi)) < 2L) next
    if(is.factor(zi)) {
      oi <- order(zi)
      proci <- process[oi, , drop = FALSE]
      ifac[i] <- TRUE
      iord <- is.ordered(zi) & (control$ordinal != "chisq")
      
      ## order partitioning variable
      zi <- zi[oi]
      # re-apply factor() added to drop unused levels
      zi <- factor(zi, levels = unique(zi))
      # compute segment weights
      segweights <- if(control$caseweights) tapply(weights[oi], zi, 
          sum) else table(zi)
      segweights <- as.vector(segweights)/nobs
      
      # compute statistic only if at least two levels are left
      if(length(segweights) < 2L) {
        stat[i] <- 0
        pval[i] <- NA_real_
      } else if(iord) {
        proci <- apply(proci, 2L, cumsum)
        tt0 <- head(cumsum(table(zi)), -1L)
        tt <- head(cumsum(segweights), -1L)
        if(control$ordinal == "max") {  ### ordinal case
      stat[i] <- max(abs(proci[tt0, ] / sqrt(tt * (1-tt))))
      pval[i] <- log(as.numeric(1 - mvtnorm::pmvnorm(
        lower = -stat[i], upper = stat[i],
        mean = rep(0, length(tt)),
        sigma = outer(tt, tt, function(x, y)
          sqrt(pmin(x, y) * (1 - pmax(x, y)) / ((pmax(x, y) * 
              (1 - pmin(x, y))))))
      )^k))
    } else {
      proci <- rowSums(proci^2)
      stat[i] <- max(proci[tt0] / (tt * (1-tt)))
      pval[i] <- log(strucchange::ordL2BB(segweights, nproc = k, 
          nrep = control$nrep)$computePval(stat[i], nproc = k))
    }
  } else {      
    stat[i] <- sum(sapply(1L:k, function(j) (tapply(proci[, j], 
          zi, sum)^2)/segweights))
    pval[i] <- pchisq(stat[i], k*(length(levels(zi))-1), 
                log.p = TRUE, lower.tail = FALSE)
  }
} else {
  oi <- if(control$breakties) {
    mm <- sort(unique(zi))
    mm <- ifelse(length(mm) > 1L, min(diff(mm))/10, 1)
    order(zi + runif(length(zi), min = -mm, max = +mm))
  } else {
    order(zi)
  }
  proci <- process[oi, , drop = FALSE]
  proci <- apply(proci, 2L, cumsum)
  tt0 <- if(control$caseweights && any(weights != 1L))  
                    cumsum(weights[oi]) else 1:n
      
      from_to <- tt0 >= from & tt0 <= to
      stat[i] <- if(sum(from_to) > 0L) {
        
        xx <- rowSums(proci^2)
        xx <- xx[from_to]
        tt <- tt0[from_to]/nobs

        max(xx/(tt * (1 - tt))) 
        
      } else {
        0
      }
      pval[i] <- if(sum(from_to) > 0L) logp.supLM(stat[i], k, 
                    lambda) else NA
    }
  }
  
  ## select variable with minimal p-value
  best <- which.min(pval)
  if(length(best) < 1L) best <- NA
  rval <- list(pval = exp(pval), stat = stat, best = best)
  names(rval$pval) <- names(z)
  names(rval$stat) <- names(z)
  if(!all(is.na(rval$best)))
names(rval$best) <- names(z)[rval$best]
  return(rval)
}

## GDP of a logistic regression.
set.seed(77777L)
N      <- 10
b_low  <- 0.1
b_high <- 25
x1     <-  rnorm(n = N, mean =  1.5, sd = 4)
z1     <-  rnorm(n = N, mean =  0, sd = 1)
z2     <-  rnorm(n = N, mean =  1.5, sd = 4)
y_lin  <-  x1* ifelse(z1>0 , b_high, b_low)   
pr     <-  1/(1+exp(-y_lin)) 
y      <-  rbinom(N,1,pr)  
clust  <- sample.int(n = 100, size = N, replace = TRUE)
df <- data.frame(x1 = x1, z1 = z1, z2 = z2, y = y, clust = clust)
## Estimated model
m <- glm( y~x1, data=df, family="binomial")
## fluctuation test using sandwich matrix   
#       control$vcov <- "sandwich"

## Inputs for the m-fluctuation test.
estfun <- sandwich::estfun(m)
z <- data.frame(z1=z1, z2=z2)
obj <- m
weights <- rep(1,nrow(df))

## mob$control minimal required options 
control <- list(vcov =NULL, 
            caseweights = F, 
            mtry =  Inf, 
            trim =  0.1, 
            breakties = FALSE )
## minsize of leaf
minsize <-  2
## fluctuation test using outer product gradients 
control$vcov <- "opg"
mob_grow_fluctests(estfun= estfun, 
                   z = z, 
                   weights=weights,
                   obj = obj, 
                   cluster = clust)

# $pval
#        z1        z2 
# 0.7224722 0.6902683 
# 
# $stat
#       z1       z2 
# 3.653720 3.836886 
# 
# $best
# z2 
# 2 
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
TTT
  • 229
  • 1
  • 8

1 Answers1

1

The $W_j(t)$ is a proportion of $t$ (with $0 \le t \le 1$) of the cumulative sum of the scores when ordered by splitting variable $j$. Thus, $W_j(t)$ is not a matrix but a $k \times 1$ vector, just like the scores $\hat \psi_i$ of observation $i$. Hence, these are conformable with a matrix product with a $k \times k$ matrix.

The limiting process to this (under the null hypothesis of parameter stability) is a continuous $k$-dimensional Brownian bridge. But in finite sample the empirical process can be written as a $n \times k$ matrix:

\begin{eqnarray*} \left( \begin{array}{l} W_j(0)^\top \\ W_j(1/n)^\top \\ W_j(2/n)^\top \\ \vdots \\ W_j((n-1)/n)^\top \\ W_j(1)^\top \end{array} \right) \end{eqnarray*}

In short: The equation in the paper gives one element (row) of the process and only the entire process can then be seen as a matrix.

What the R code does is to compute the entire process in one go, hence the transposing etc., to do this in a more compact way.

I also agree that the theory has not been written down in a lot of detail in the MOB paper because the corresponding inference framework was published in: Achim Zeileis, Kurt Hornik (2007). "Generalized M-Fluctuation Tests for Parameter Instability." Statistica Neerlandica, 61(4), 488-508. doi:10.1111/j.1467-9574.2007.00371.x

And the work with Ed Merkle uses a more standard notation focused on the maximum likelihood special case: Edgar C. Merkle, Achim Zeileis (2013). "Tests of Measurement Invariance without Subgroups: A Generalization of Classical Methods." Psychometrika, 78(1), 59-82. doi:10.1007/s11336-012-9302-4

Achim Zeileis
  • 13,510
  • 1
  • 29
  • 53