0

When I look up what exactly the offset is doing, I only see references to poisson or negative binomial models. Neither of those analyses work for me because I do not want to predict counts. I want to predict proportions.

My questions come down to: Can you use an offset in binomial logistic regression? If so, what is it doing mathematically? And is the variable I want to use for an offset appropriate for that purpose?

Here's an explanation in pretend terms:

I and a partner went to a bunch of sites and divided them into two zones: A and B. Within each site, I sampled to see if I could detect some species every day in zone A, while my partner sampled for the same species every day in zone B right next to me. My binomial response is the number of detections in A vs the number of detections in B, like comparing the number of heads and tails if you flip a coin 100 times. It's binomial because there are no other options. At a given site, a detection was either in A or it was in B.

The tricky part is that A and B sometimes weren't sampled equally. Let's say that we planned to sample at one of the sites for 20 days, but my partner, who was watching B, fell ill and had to leave after 10 days, leaving me to just watch A for the remainder of the time. So, at this site, A was sampled 20 times while B was sampled 10.

In the full period, I detected the species 8 times in A in 20 days, and my partner detected it 4 times in B before they had to leave after 10 days.

Now, using just the counts in A and B, it looks like we got twice as many captures in A as we did in B, and we did as far as raw numbers go (8 vs 4). But that doesn't account for that reduced sampling effort in B.

So, to account for that, I would like to use the proportion of days A was sampled relative to B as an offset. In this example, I would divide 20 by (20+10) to get an offset value of .66. It would tell the model that while there were twice as many detections in A as there were in B at this site, 66% of samples at the site were taken in A.

The reason I can't use raw sampling numbers is because the unevenly sampled zone varies site to site. At some sites, zone B was sampled more, and other sites, zone A was sampled more. My sampling unit is the site, not the zone.

The underlying problem here is that I don't want to predict the number of observations per day; I want to predict a proportion of total observations. I have my own data to calculate these proportions, but I also have a long term dataset of observations in B, but none in A. So for every one observation in B, how many were in A? That's the question I'm ultimately trying to answer with the data I collected myself.

Steve
  • 11
  • 2
  • Your statement "At a given site, a detection was either in A or it was in B" is hard to understand from a biological perspective; perhaps there's some aspect of your actual situation that isn't captured by this example. Is it possible that there could be _no_ detection in _either_ of zone A or zone B on any given day? Is the pairing within days important? – EdM Jan 21 '22 at 18:26
  • @EdM Yes, it is possible to get no detections. Just to provide a bit more detail, we're looking at species activity around hiking trails. Our two zones are close to trail and far from trail. We have a bunch of data of observations of the species on trails, but we want to know how many might have been missed because the species was too far off trail to be seen. That's where the data I collected on both zones comes in. Pairing within days is not important. It could just as easily be within hours or weeks. The key bit is that sampling effort wasn't balanced. – Steve Jan 21 '22 at 18:32
  • So it was possible _in principle_ to observe a species in both of zone A or zone B on any given day, that just didn't happen to occur in your data? – EdM Jan 21 '22 at 20:50
  • @EdM It was possible and it did happen occasionally. – Steve Jan 21 '22 at 21:31
  • I think this might be a duplicate of this https://stats.stackexchange.com/questions/148699/modelling-a-binary-outcome-when-census-interval-varies/148728#148728, i.e. "use a binomial model with a cloglog link" ... ?? – Ben Bolker Jan 22 '22 at 00:14
  • @BenBolker the only difference I can see is that in that example, exposure is a direct measurement of time. For me, the equivalent would be the proportion time dedicated to sampling zone A. E.g., at site 1, I found 30% of observations in Zone A (response) and I spent 40% of my time sampling there (offset). Taking log(%time) would give all negative values. I don't know what any of that means regarding the validity or success of the approach, though. However, I think you have answered my question about whether using an offset on a binomial regression is possible, which it seems like it is. – Steve Jan 22 '22 at 02:16
  • @BenBolker I've got a few follow up questions. First, why cloglog instead of logistic link function? What would happen if I constructed an model with your proposed method, except I use a logistic link instead of cloglog link? Second, do you see any issues with using a proportion (Time in A / Total Sampling Time) or a ratio (Time in A / Time in B) as the offset instead of a raw exposure value? Third, do you have any recs of wildlife/ecology papers that use this method? I'm finding examples outside of the field (epidemiology, risk analysis), but I'd like to have one within my field. – Steve Jan 26 '22 at 11:41

1 Answers1

1

There is an "offset" argument for a call to glm(), but in a binomial model it's interpreted as the number of total trials. It's not clear that would work well with your differential observation durations. As explained in this answer, it's hard to use a regression offset term to accomplish what you wish with a logistic regression model.

This is best analyzed as a count problem. I'd start with a Poisson model (log link) that includes the log of the observation duration as a regression offset term (coefficient fixed at 1) for each observation period in each zone.

It would be ideal if you have the actual number of observations for each day and zone. You then would model the number of observations per unit observation duration for each of zones A and B, and could use that to estimate what seems to be your main interest, the ratio of observations between the 2 zones given the same observation duration for both zones. It would be less satisfying if all you have is "detected" (count of 1) or "not detected" (count of 0) as your data, but that wouldn't be far off from the ideal if the probability of more than 1 detection during any one observation period was small.

EdM
  • 57,766
  • 7
  • 66
  • 187