How to calculate linearity to the logit when IV includes a meaningful zero (0) and is limited to 6 predefined levels

Question

I am calculating a logistic regression with the amount of letters of a word given (0-5) in a task as the IV and whether the corresponding word was recalled in a later test or not (0/1) as the DV. All words in the dataset are five-letter words.

I want to check the assumption that my continuous predictor lettersProvided has a linear association with the logit of the outcome. This is usually investigated with the Box-Tidwell Test:

There are multiple ways to do this in R, here are two.

Calculating a regression with an added interaction term of the IV and the ln of the IV:

lessR::Logit(correctAnswer ~ ltrsProvided + ltrsProvided:log(ltrsProvided), data= myData)

In this case, a sifgnificant interaction term (p<.05) indicates non-linearity.

Using a dedicated function from the car package:

car::boxTidwell(rec_correct~ltrsProvided, data = myData)

However, both methods do not work with my data because it includes zero, for which the ln is undefined.

There are some questions on the network about the Box-Tidwell test, but none of those with answers talk about my problem specifically. Some resources mention a transformation of the IV (only for the Box Tidwell, as the regression works fine with 0s in the data!), but what kind of transformation is that? Adding 0.0001 to each level of ltrsProvided enables calculation of the ln but seems weird as the variable cannot realistically include e.g. 1.0001. Others talk about dropping the zeros which is impossible for me because 0 is an important level in my IV.

Frequently, splining comes up to similar (but not the same) questions but I dont think its appropriate in my case (few levels) and I have no reason to suspect that there is no linear relationship.

tl;dr: In what way can I transform my continuous IV temporarily to conduct a Box Tidwell Test?

Example of a reduced version of my data:

ltrsProvided   correctAnswer
           1             1
           2             1
           3             1
           4             0
           0             1
           1             1
           5             0
           2             1
           0             1
           4             0

Code and Plot regarding @Demetri Pananos 's post and comment.

The relationship does not look linear at all...

vis_grouped_d <- grouped_d %>% mutate(prop = y/(n+y), prop_log = log(prop))
ggplot(vis_grouped_d) + geom_line(aes(ltrs_gvn, prop_log))

More information about the research design

Its basically a replication of McCurdy et al. (2021), but online. Its an experiments on the generation effect (content that you create yourself is remembered better than content you only read) and the generation constraint (the degree of the self generation might have an influence on the memory benefit from generation). In the experiment the subjects had to create and write down target words with the help of a cue and 0-5 letters of the target given to them. In a later test, their target words and some distractors were presented to them and they had to decide whether that word was created earlier (recognition correct; 0/1). The ltrs_given levels 0-4 are basically the manipulation of the generation constraint while ltrs_gvn is the read control condition.

They used mixed effect logistic regression with random intercepts for subjects and words (Code and Data are available online). I want to use regular logistic regression because there are so few resources for mixed models.

score 0 · Answer 1 · answered Mar 15 '21 at 23:19

I've never heard of a Box-Tildwell test. There easier ways in my opinion to assess model fit. Here is one way...

One approach you could use is a deviance goodness of fit test. Roughly, this is an omnibus test for goodness of fit so you actually want to fail to reject the null. Let me set up some fake data here

library(tidyverse)

set.seed(0)
N = 120
x = sample(1:5, size = N, replace = T)
eta = x - 0.8
p = plogis(eta)
y = rbinom(N, 1, p)
d = tibble(letters_provided=x, recalled=y)

There is a strong relationship between the letters provided and the probability of recall in this example. Note that the true effect really is linear. To do the deviance goodness of fit test, first fit a logistic regression in which we group the data by the letters provided (very natural since you only provide a finite number of letters)

grouped_d = group_by(d, letters_provided) %>% summarise(n = n(), y = sum(recalled))

model = glm(cbind(y, n-y) ~ letters_provided, data = grouped_d, family = binomial())

and then analyze the residual deviance statistic

deviance_gof = function(model){
  # Deviance has asymptotic chi square distribution
  dev = model$deviance
  dof = model$df.residual
  # Manual calculation of the p value for the test
  p.value = pchisq(dev, dof, lower.tail=F)
  p.value
  
}

deviance_gof(model)
[1] 0.6443274

Failure to reject the test roughly means that the model is a good fit. So, if you fail to reject this test, you can conclude that the assumption of linearity is good (or maybe more carefully, that you can not conclude that linearity is not a good assumption from the data. It may be there are non-linear effects which require more data to estimate precisely).

If you crank up the sample size in this example, the deviance usually tends to the degrees of freedom which is the median for the chisquare. Hence the p value will hover around if the model you've chosen is the right model. However, more data can come back to bite since even modest non-linearity can be precisely estimated and ruin the test. If you had lots of data an on the logit scale the data had functional form $\log(x/10) + 1.5$ then you would reject the null of the deviance goodness of fit test even though linearity is a fine assumption to make.

Its funny you mention splines and their inappropriateness. If you have no reason to suspect a non-linear relationship, that means either the effect is 0 or it is linear. If you're willing to assume it is linear, why would you want to test for a linear effect? In my opinion, splines would be a good approach because in the worst case scenario you spend a few degrees of freedom to estimate a linear effect. That isn't so big a deal as the other option; the effect is truly non-linear and you introduce a ton of bias by not modelling it as non-linear. Additionally, you could perform a chunk test to determine if the non-linear terms reduce additional variation in the data. THAT would be a hell of a way to address the assumption of linearity. Here is how to do a chunk test in R with rms

model2 = lrm(recalled ~ rcs(letters_provided,3), data = d)
anova(model2)
               Wald Statistics          Response: recalled 

 Factor           Chi-Square d.f. P     
 letters_provided 7.82       2    0.0201
  Nonlinear       4.13       1    0.0422
 TOTAL            7.82       2    0.0201

You can see that the nonlinear test has a significant p value (although not too significant, so the result is suspect IMO and is likely due to the small sample size of 120) meaning that the non-linearities seem to result in a better fit.

TL;DR Do a deviance goodness of fit test or do splines and do a chunk test for the non-linear terms.

The first method returns a p indicating high significance (1.608934e-17). That might be due to my N>3000, right? Do you know any citable resources? You wrote: "If you have no reason to suspect a non-linear relationship, that means either the effect is 0 or it is linear." and " If you're willing to assume it is linear, **why would you want to test for a linear effect?**". My course resources simply say this assumption has to be checked. I'm only struggling because my IV includes a zero. From my understanding splines etc should be introduced if/once the assumption is violated... — polarlicht, Mar 16 '21 at 10:48
Also, is the assumption of the linearity to the logit really a model fit indicator? — polarlicht, Mar 16 '21 at 13:54
@polarlicht Correct, your large sample size is likely the culprit of the rejection. Again, that isn't necessarily a bad thing. Because you have such a large sample, group the observations by the number of letters held out. Compute the proportion of correct recalls and plot the logit of the proportion. Does the line look straight? — Demetri Pananos, Mar 16 '21 at 14:01
I added the code and the corresponding plot to my original post. It does not look look linear at all... (And not even fully monotone but that is probably negligible?) — polarlicht, Mar 16 '21 at 16:22
The n for all 6 groups ranges from 519 to 535. ltrs_gvn = 5 has 527. — polarlicht, Mar 16 '21 at 16:28
Hmm, there seems to be a strong effect of leaving 5 letters out. Is this not all the letters? — Demetri Pananos, Mar 16 '21 at 16:29
Its the other way round: All words used are five letter words, therefore the 5 means *no* letters were left out. — polarlicht, Mar 16 '21 at 16:36
@polarlicht So if no letters are left out, shouldn't everyone recall the word? — Demetri Pananos, Mar 16 '21 at 17:49
I added some information about the research design to the post. Maybe that will clear things up! — polarlicht, Mar 16 '21 at 20:29
@polarlicht email me. I think this conversation needs to be taken offline so that I can properly grasp what is going on. You can find contact info in my profile. — Demetri Pananos, Mar 16 '21 at 20:36

How to calculate linearity to the logit when IV includes a meaningful zero (0) and is limited to 6 predefined levels

Code and Plot regarding @Demetri Pananos 's post and comment.

More information about the research design

1 Answers1