Linear modelling approach

Question

I'm trying to evaluate whether a measure is improving with time.

This broadly reflects my real life data...

df <- data.frame(ch = rep(1:10, each = 12), # care home id
                 month_id = rep(1:12), # month using the system over the course of a year (1 = first month, 2 = second month...etc.)
                 totaladministrations = rbinom(n=120, size = 1000, prob = 0.6), # administrations that were scheduled to have been given in the month
                 missed = rbinom(n=120, size = 20, prob = 0.8), # administrations that weren't given in the month (these are bad!)
                 beds = rep(rbinom(n = 10, size = 60, prob = 0.6), each = 12), # number of beds in the care home
                 rating = rep(rbinom(n= 10, size = 4, prob = 0.5), each = 12)) # latest inspection rating (1. Inadequate, 2. Requires Improving, 3. Good, 4 Outstanding)


# Summary measures
df$missed_pct <- df$missed / df$totaladministrations * 100 # missed meds as a percentage of all scheduled administrations
df$missed_dm <- df$missed / 30.416 # missed meds daily mean for the month

# classifications
df$ch_size <- car::recode(df$beds, "lo:29 = 1; 30:36 = 2; 37:hi = 3", as.factor = TRUE)

str(df)

> str(df)
'data.frame':   120 obs. of  9 variables:
 $ ch                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ month_id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ totaladministrations: int  555 586 607 604 598 597 588 585 573 570 ...
 $ missed              : int  14 16 18 15 16 15 15 19 17 14 ...
 $ beds                : int  36 36 36 36 36 36 36 36 36 36 ...
 $ rating              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ missed_pct          : num  2.52 2.73 2.97 2.48 2.68 ...
 $ missed_dm           : num  0.46 0.526 0.592 0.493 0.526 ...
 $ ch_size             : Factor w/ 2 levels "2","3": 1 1 1 1 1 1 1 1 1 1 ...

head(df) 

> head(df)
  ch month_id totaladministrations missed beds rating missed_pct missed_dm ch_size
1  1        1                  555     14   36      1   2.522523 0.4602841       2
2  1        2                  586     16   36      1   2.730375 0.5260389       2
3  1        3                  607     18   36      1   2.965404 0.5917938       2
4  1        4                  604     15   36      1   2.483444 0.4931615       2
5  1        5                  598     16   36      1   2.675585 0.5260389       2
6  1        6                  597     15   36      1   2.512563 0.4931615       2

tail(df)

> tail(df)
    ch month_id totaladministrations missed beds rating missed_pct missed_dm ch_size
115 10        7                  575     15   31      0   2.608696 0.4931615       2
116 10        8                  590     18   31      0   3.050847 0.5917938       2
117 10        9                  590     18   31      0   3.050847 0.5917938       2
118 10       10                  578     14   31      0   2.422145 0.4602841       2
119 10       11                  590     19   31      0   3.220339 0.6246712       2
120 10       12                  567     18   31      0   3.174603 0.5917938       2

Here, care home 1 was scheduled to give 574 drug administrations in month 1, and missed 13 of them (i.e. 13 failures and 561 successes). There are 37 beds in this care home meaning its size is 3 (i.e. larger than 2 and 1) and was rated 3 at the last inspection (3 being 'good').

$beds is a mostly constant figure. This is the number of beds a care home has registered with the authorities to provide care for. It does not reflect the daily occupancy rate (i.e. there may be less residents in the care home on a given day). It can change but doesn't tend to do so that much so for the purpose of this I'm happy to treat it as a fixed effect. $beds determines $ch_size.

$rating is given by the authority after an inspection, and is technically the 'most recent rating' where 1 = 'Inadequate', 2 = 'Requires Improving', 3 = 'Good' and 4 = 'Outstanding'. To me this is a categorical ordinal variable and can be treated as such in a model. Inspections are carried out at intervals ranging from every few months to every two years (frequency is determined by the last rating, where lower rated care homes are inspected more frequently). I'm happy to treat this as a fixed effect for now to keep it simple.

My real life data has about 700 care homes. The variance in sample data posted here probably doesn't reflect real life variance (because I'm newish to R and don't know how to do that) but hopefully it doesn't need to for this question as I'm interested in model construction rather than interpretation.

I want to see if missed medications improve (i.e. reduce) over time and whether this is affected by its size and/or rating.

I need help with:

Which is the better response measure to use: missed (Number of Missed Medications), missed_pct (ratio missed/total, expressed as a percentage), or missed_dm (daily mean)?

Bonus points awarded for which models to subsequently use.

Note: I have excluded individual resident data for this project because of resource constraints and I wanted to keep it simple and do what I can with the summary count data. There may well be underlying factors also affecting missed medications, such as the ratio between number of care givers and number of care receivers on a given shift ... but that's a question for another day / project.

As for (1), whether a Poisson model is appropriate of course depends on what you'll choose as a response variable. Since you mention you have both `missed` and `total`, I think the best way is *not* to turn this into a percentage, but to model these as is with a binomial GLM. If the question is reopened I will elaborate on this in an answer. — Frans Rodenburg, Jun 03 '19 at 00:17
@FransRodenburg thanks, have edited accordingly. Hopefully the remaining question encourages explanation on what subsequent choice of model to use, or should I be more explicit? — B_Real, Jun 03 '19 at 09:45

Frans Rodenburg · Accepted Answer · 2019-06-06T22:56:56.520

Research Question

Quoting your question:

I want to see if missed medications improve (i.e. reduce) over time and whether this is affected by its size and/or rating.

This can be incorporated into a regression model, using missed medications out of total as a response variable, and time, size and rating as explanatory variables. The coefficient of time can then tell you whether there are improvements (if there are, it will be negative).

Choosing an Error Distribution

Which is the better response measure to use

Your outcome variable is a ratio: $$(\text{missed medications}) : (\text{total medications})$$ These can be modeled using a binomial GLM. After all, the binomial distribution is the number of successes out of $n$ trials, so just replace the words successes and trials with missed and total, respectively.

So why not any of the three response variables you suggested?

missed ignores the scheduled total medications;
missed_pct ignores the magnitude. 1 out of 2 will become the same as 1000 out of 2000, while the latter should obviously weigh heavier;
missed_dm is essentially the same as missed, but also turns your counts into non-integer values. This is wasteful, as the (ratio of) integers can be argued to follow a certain distribution, but these pseudo-counts not so easily.

Modeling Repeated Measures over Time

Improvement over time means that you have repeated measures of the same care homes. These are not independent: Some care homes may take greater care to avoid missing medications than others. Perhaps more importantly, the number of beds surely depends on the care home as well, so you should account for this. This can be done using a random effect for $\text{care home}$ in the context of a mixed model.

The same argument could be made for the residents in a care home: The number of missed medications depends on the resident. Residents are nested within the care house they are residents of, so you could include $\text{resident ID}$ as a random effect nested in $\text{care home}$.

(You didn't explicitly mention measurements of residents in your question, but as your data size is $12\times$ larger than the recorded variables suggest it should be, I assumed this was the case. Please edit your question to clarify this.)

As I understand from your updated question, this is not the case, so you only need a random effect for $\text{care home}$.

Constructing the Model

A model with both fixed and random effects is a mixed model, and a linear model with a non-normal error distribution in the exponential family (such as the binomial distribution) is called a generalized linear model (GLM). A regression model that has both these proporties is called a generalized linear mixed model (GLMM). There is a great package in R that can model GLMMs called lme4:

require(lme4) # install if you don't have it yet
GLMM <- glmer(cbind(missed, totaladministrations) ~ beds + rating + month_id + (1 | ch), family = "binomial", data = df_short)

You can then inspect the summary, perform diagnostics and bootstrap confidence intervals for your estimates.

Unfortunately I cannot help you with the interpretation of the results. The number of beds in your example data is invariant within care houses and none of the variables are simulated to correlate with the outcome. However, this should give you a fair idea of how you can model these data to answer your research question.

Choosing a Model

If you have a large number of variables you intend to include/exclude based on some criteria, I recommend you have a look here first to avoid inflated false positive rates and spurious findings.

That said, there are several options for model selection. You can compare a limited set of models based on AIC/BIC and the likes. You can also leave out a random subset of observations to compare the predictive performance of the different models. Have a look at questions here on cross-validation and model selection.

The below part is no longer relevant after clarification in the comments.

$\dagger$: I ignored the rating variable, because you should generally refrain from discretizing continous variables. If the inspection rating is all you have in your actual data, then you may have to resort to that. However, if the inspection simply counts the number of beds, ask for this number instead. Frank Harrel has a nice post about why you shouldn't discretize continuous variables.

thanks so much for the response. I have updated the question to address some points. †: rating is a categorical ordinal variable, not continuous. Sample is now 120 rows not 1440 (my bad, apologies). No individual resident data is included here. — B_Real, Jun 04 '19 at 11:22
The answer suggests using the ratio `missed` : `totaladministrations`, is this not effectively the same as using `missed_pct` (i.e. the ratio expressed as a percentage)? — B_Real, Jun 04 '19 at 11:35
@B_Real Thank you for clarifying. However, can you let me know what the rating given by inspection is based upon? If it is based solely on the number of beds, then I think you're better off using `beds` as a variable in your model for the reason given in my answer. — Frans Rodenburg, Jun 06 '19 at 00:55
the rating is based upon a physical inspection by the regulators. The regulators send a team of inspectors to the care home to assess various factors such as how safe the residents are, how effective the care being given is, how caring they are being, how responsive the care home is to the residents needs and how well led the care organisation is. It is not based on the number of beds, but rather the quality of care being given. In theory, care homes with higher ratings should see improvements in measures, but this may not always be the case in reality — B_Real, Jun 06 '19 at 09:00
Just to add to the ratings: see here: https://www.cqc.org.uk/what-we-do/how-we-do-our-job/ratings. To save a click - the ratings go from Inadequate to Requires Improvement to Good to Outstanding (in order from worst to best). This is an 'overall' rating which is comprised of 5 component sub ratings (safe, effective, caring, responsive, well-led (as described above)). Future work would be to include the sub ratings in my model, but i'm only concerned with the 'overall' rating for now to keep it simple. — B_Real, Jun 06 '19 at 09:28
Just for clarity, `beds` is not related to `ratings` in any way. `beds` is merely a reflection of how large or small the care home is. — B_Real, Jun 06 '19 at 10:08
Hmm, if I run the equivalent of the suggested answer on my real data I get a strange warning... `> glm_mm_n — B_Real, Jun 06 '19 at 15:15
@B_Real, I updated my answer to include the rating variable based on your comments. However, as I mentioned in my answer, I cannot help you debug or interpret the results, because you don't have a MWE (that produces the same warning) or the actual data (so I can reproduce the warning). However, there are several questions on CV about lme4 convergence messages, and also please have a look here: http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#convergence-warnings — Frans Rodenburg, Jun 06 '19 at 22:59