Difference in Group Sample Rate, not Size, based on Independent Variable

Question

I'm working on a study that involves measuring breathing effort of an individual over short periods of time (specifically breath distention). The dependent variable is based on a new therapeutic and control groups. There are an equal number of individuals in each group (which is actually a small miracle of coincidence considering recruitment goals are often not met).

However, a variance in the "length" (sample rate) of the reads occurs from the devices used to measure the breath distention. The probes are sampling at 70Hz, and given results from each reading, paired with actual video observation, will give a composite value for each observation (again at 70Hz).

We eliminate "bad" reads (where the body was moving, or the probes aren't in sync, etc), and have good empirical methods for doing so. This normally means we end up with between 1 and 2.5 min of "good" uninterrupted read time for each individual.

If you are sampling at 70Hz, the difference of 90s leads to substantial difference in records for any one individual. I would consider this difference in time to be an independent variable, not likely related to the interventions performed (though can't strictly rule that out).

The data sets for each individual seem normally distributed, and the control groups also seem normally distributed. Therefore, I'm inclined to ignore the differences in records per individual, and continue with an ANOVA for the single time points.

Would it be more appropriate to trim or transform the data so that the sampling was weighted or normalized per individual?

I had a hard time finding anything addressing different sample rates over identical sample size, so forgive me if I missed something obvious.

@SteveSamuels The breath distention. Comparing between different infections and drug treatments, you might see 1 group median average at 12 and another at 21. — Atl LED, Jan 13 '16 at 22:52
Thanks for the information. The fact that you intend to use an ANOVA indicates that you will be analyzing means. It might help in composing an answer to know what software you intend to use. — Steve Samuels, Jan 14 '16 at 00:22
@SteveSamuels Prism would be great as that's the program I eventually have to get it in for my departments copy editor, but I often will play around in R first — Atl LED, Jan 14 '16 at 00:31
Am I correct that you have readings of "breath distention" sampled at 70 Hz and that you are intending to use the average "breath distention" over your period of reliable measurement (1 to 2.5 minutes) as the outcome variable? Or is there some change in "breath distention" during the measurement period that you are trying to estimate? — EdM, Jan 20 '16 at 17:54
@EdM Correct, on the 1st account. I plan to make an average of the good reads per individual, then look at the groups. The question is what sort of precaution is need when each individual has a different number of observations. — Atl LED, Jan 20 '16 at 21:09

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

In general, you don't want to throw data out. It can, however, make sense to put more weight on the data values that are the most reliable. In a way you are doing that already, as you are (correctly) throwing out portions of records that are technically unreliable.

More generally, if data points differ in their reliability then weighted regression can preferentially incorporate information from the more reliable data points. (Note that ANOVA and regression are essentially the same problem in different guises.)

So the question becomes whether having 1 minute versus 2.5 minutes of reliable data in a session makes much difference in the precision of your average "breath distention" measurement over that session. You can try to judge that based on your knowledge of the subject matter and by looking carefully at the data. For example, in cases with a full 2.5 minutes of reliable data, you could examine the variability of measurements among samples of different durations. If there isn't much evidence that cases with 1 minute of reliable recording are substantially less reliable than those with 2.5 minutes, then you should be OK with ignoring the differences among sessions with respect to reliable recording duration. In explaining your work to others, it may be easier to say "We took the average breath distention over the technically reliable portions of each recording session" rather than trying to justify a particular weighted analysis, particularly given the dangers in weighted analyses that are noted in the page linked above.

Steve Samuels · Answer 2 · 2016-01-21T17:36:49.593

You have what is known as a repeated measure or longitudinal study. Long experience has shown that sequential measurements on an individual almost always show some kind of correlation or other time-related effect. Therefore, the within-individual records cannot be considered independent.

Below is an outline of a longitudinal model, also known as a mixed-effects model, because it includes both fixed and random effects . All individuals who have at least two test periods can be included in the analysis, so unequal numbers of observations per individual or per treatment are not a problem. These models can be fit by the LME4 package in R (https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf). A good reference (there are many) is Fitzmaurice, Laird, and Ware (2011)

Let $i$ index individuals and $j$ index test periods, indicated by $P_{ij}$ and $Z = 0,1$ is the treatment indicator. The observations are $y_{ij}$, and the model is for the expected value $\mu_{ij} = E(y_{ij})$. The simplest mixed effect regression model is:

$$ \mu_{ijk} = \mu + c\, Z + \beta P_{ij} + u_i + e_{ij} $$

Here $\mu$ is the grand mean; $c$ is the main effect of treatment; $\beta$ is linear effect of period; $u_i$ is a person-specific random effect, assumed to have a $N(0,\sigma_b ^2)$ distribution; $e_{ij}$ is the occasion-person random effect, assumed to have a $N(0,\sigma_e^2)$ distribution.

A more realistic model is one in which there is an interaction between treatment and period:

$$ \mu_{ijk} = \mu + c\, Z + \beta P_{ij} + \gamma P_{ij} Z + u_i + e_{ij} $$

A good start to the analysis is to plot the observations against period, connecting the points for each individual. This will reveal non-linearities in the period trends, unequal standard deviations in the two treatment groups, and randomly varying slopes. All of these phenomena can be tested for, and, if present, added to the final model.

Reference

Fitzmaurice, G. M., N. M. Laird, and J. H. Ware. 2011. Applied Longitudinal Analysis. 2nd ed. Hoboken, NJ: Wiley.

Difference in Group Sample Rate, not Size, based on Independent Variable

2 Answers2