1

I have to be somewhat vague regarding data and such for confidentiality purposes (I'm not allowed to share whole data). I have a dataset (X) that represents transects (representing line transects of animals in space).

  Birds        area      len       transect
  1            0.239310  0.478621  1
  2            0.238463  0.476927  1
  1            0.244382  0.488765  2
  4            0.236501  0.473002  2 
  0            0.245832  0.491665  3
  1            0.241026  0.482052  4 

When I calculate the mean density of the birds, e.g.:

meandensity <- sum(X$Birds)/sum(X$area)

in the dataset, I get a mean value of ~3.63.

The problem is I've been given a MATLAB module written by someone else (and I can't read MATLAB code) who claims to be using a block bootstrap method to estimate mean and confidence intervals. When I run that module, the mean density gets estimated at ~4.6.

Now, I spoke to this person and have tried to replicate his bootstrap method and am getting a value of ~3.5.

My questions: Can the bootstrapped estimate of the mean really differ this much from the mean of the whole dataset? My understanding was that the bootstrap estimate shouldn't be super far from the mean, and differences between bootstrap methods would impact the CIs more than the estimated mean.

  • 3
    Please include what "block bootstrap" is according to you and in this specific example. – Jim May 08 '18 at 16:33
  • 2
    There is a block bootstrap that is used in time series analysis, but what kind of block bootstrap is being used in this situation. How does it differ from the ordinary bootstrap? – Michael R. Chernick May 08 '18 at 16:55
  • Well, you seem to have spatial dependence in your data and, if the data have been collected over time as well as over space, you'll have temporal dependence. So whatever block bootstrap method you will use will have to reflect the spatial and/or temporal dependence aspects of the data. – Isabella Ghement May 08 '18 at 18:11
  • 1
    Thanks for the questions. Block-bootstrapping in this case is spatial. So, the bootstrap unit is Transect ID, instead of each individual data point. – GrantRWHumphries May 08 '18 at 21:10
  • 1
    @isabellaGhement there is a temporal aspect, but it's very short. These are aerial surveys, and so it's a matter of hours for an entire survey. We treat them here as "snapshots" – GrantRWHumphries May 08 '18 at 21:12
  • Yes it can. Actually the différence may be seen as a first approximation of bias. That being said, block bootstrap in space is clearly more tricky than its temporal counterpart, since it implies to somehow consider your spatial structure without breaking it at all... by opposition to the time series case which allows one to consider small chunks of adjacent temporal observations. Which block-bootstrap method are you using? – keepAlive May 08 '18 at 21:46
  • @Kanak - we are block-bootstrapping by transectID - where we sample the transects, rather than individual points (data rows) – GrantRWHumphries May 08 '18 at 23:33
  • Are your transects spatially independent? – keepAlive May 09 '18 at 08:14
  • @Kanak Yes they are - each transect is approx 5 km from the last. Spaced evenly across the study area – GrantRWHumphries May 09 '18 at 08:59
  • So what is the justification behind the use of the block- method? I do not see any. Do you see my point? – keepAlive May 09 '18 at 09:09
  • @Kanak Aye - I don't disagree with you at all - a standard bootstrap gives a value of ~3.5 (which is what I started with), which matches closely the estimate I get using a distance analysis. This "black box" (to me) method that's been provided to me has been explained as a block bootstrap (and is buried in a GUI so I cannot trace the code - not that I can read MatLab anyways). The GUI, using the exact same data, gives an estimated mean of ~4.5. What I can't figure is how two bootstraps on spatially independent data can give such vastly differing results. – GrantRWHumphries May 09 '18 at 10:06

1 Answers1

1

Can you find a way to assess the amount of spatial dependence in your data? If you find evidence for weak dependence, that should confirm that your understanding is correct. For that assessment, I would imagine you need some lat and long coordinates for the mid-spot of each area or something to that effect. If you don't have access to these coordinates, you would have to blindly apply block bootstrapping, which is not necessarily ideal, as you may end up either under-estimating or over-estimating the block length.

By the way, for your computation of density, shouldn't you divide the sum of the bird counts to the sum of the areas? It seems like your formula is set backwards (unless I am missing something).

I guess another way to estimate the expected density (birds/km^2) is to use a Poisson regression where the bird count in each area is the outcome variable and the area size is treated as an offset. The model would include an intercept as well as a possibly nonlinear function of lat and possibly nonlinear function of lon, with the latter two aiming to capture spatial dependence among bird counts from areas that are close to each other. (See Can I use glm with Poisson family if counts data are treated as density?.)

Isabella Ghement
  • 18,164
  • 2
  • 22
  • 46
  • 1
    Thanks - there is some level of spatial autocorrelation between points, but not between transects as it turns out. Transects being the unit that I am doing the bootstrap on. Sorry for the mix up with the formula - I made the mistake in this post, not in the code. – GrantRWHumphries May 10 '18 at 12:05
  • If you have multiple observations from the same transect, I would imagine those are correlated. – Isabella Ghement May 10 '18 at 14:43
  • 1
    Aye - they are - within transects - due to bird flocking behaviour. – GrantRWHumphries May 11 '18 at 08:48