1

I have a database filled with different bird species that were seen on different dates (10 years of records).

Each row in the table contains:

Date, Time, Bird Species, Spot where it was seen

So it looks like:

2008-04-07, 14:22:48, Himalayan Snowcock, Spot-4
2008-04-19, 11:44:01, Ring-necked Pheasant, Spot-12
...
2019-05-20, 08:51:14, American Kestrel, Spot-8

It contains thousands of records like this.

Now I need to create a calendar detailing Birds x Months and on each cell it will contain the probability of seeing that bird on that month.

How can I compute the probability based on the records I have?

whuber
  • 281,159
  • 54
  • 637
  • 1,101

2 Answers2

2
  1. A simple approach would be to create a 2 × 2 contingency table (Bird × Month) and then for each unique combination of bird and month divide the number of years where that species was seen in that month by 10, which would give you a proportion. However, this approach throws away some information

  2. If you wanted to incorporate the fact that some birds are much more common than others and that different months are of different lengths you could use the # of sightings / the number of days (accounting for leap years probably). This would capture the rarity of birds as well as make use of the fact that your information is at a finer than month resolution. If data wasn't collected on a daily basis you could instead divide by the number of days were there was an observer.

  3. You do much more complex and employ something like a time-series model to account for changes in detectability through time.... that is probably a question in itself and would be much more challenging to carry out, at least for me.

André.B
  • 1,290
  • 6
  • 20
  • Hi Andre, thanks for your answer. I'm a bit confused about option 2 which I think may work best. The only "problem" I'm thinking and I'm not sure how to face is that, for example, on a certain month, let's say april 2018 birdwatchers may have been visiting the same spot many days. So a certain bird which lives there will always be seen on those dates. So that would increase the times the bird was seen but would not mean that bird is easier to see than others. – Stephen H. Anderson May 21 '19 at 11:37
  • It would just mean that the spot was visited more times. Is there a way to have that into consideration? I didn't mention that for each row of data I also have the ID (a number) identifying the exact spot where it was seen. Something like: 2019-05-20, 08:51:14, American Kestrel, 82 (82 being the spot) – Stephen H. Anderson May 21 '19 at 11:37
  • That does complicate the problem but I am not sure there is a good way to tie that in, as there is no way to determine if it was the same bird. If you wanted to take a conservative approach you could remove records where the same bird was seen at the same spot multiple times in a given month. This would reduce the probabilities probably past what you could actually expect though. – André.B May 21 '19 at 21:15
  • Alternatively, and this would be a fair bit of work, you could look up information about the behavioural ecology of each bird species with duplicate records at a given site/month and then make a decision as to whether or not to remove duplicates based on known information about the bird's home-range and nesting behaviour. For instance, if the american kestrel was seen repeatedly in prime nesting habitat, then perhaps remove them. Conversely, if it was seen during what would normally be migratory conditions or in a habitat that they don't frequent you could keep it. – André.B May 21 '19 at 21:22
  • Hi Andre, I was thinking something similar to your first option. Something like. First time a bird is seen on a spot it's counted as 1 but after that, every extra time the bird is seen at the same spot in the same month it counts as 0,25 (or some other value?) instread of 1. Do you think this is an improvement and would give better results? Problem about your last suggestion is the number of bird species is over 1000 (different countries) and we don't even know the biology. – Stephen H. Anderson May 22 '19 at 11:51
  • A problem I have is that, for example, some birds are always in the water, like flamingos. Then if 10 days the observers went inland and only one to the lake they only saw a flamingon once out of 11 days but that doesn't mean flamingos are rare, the problem is the observers went more days to inland locations. I'm really lost about how to work with that. – Stephen H. Anderson May 22 '19 at 12:22
  • Those are good points... You certainly could modify the probability for seeing the same bird multiple times in a month at the same spot, but the catch would be that the choice of secondary probability would be somewhat subjective (i.e. why not 0.3? or 0.5?), although that is not to say that it is wrong. It comes back down to what the goal of the exercise is. From what you have said, I am assuming that you want to build a calendar that gives you a rough probability of seeing a range of bird species at a given spot in a given month, right? Or is there a greater spatial constraint? – André.B May 22 '19 at 20:47
  • Hi, we want to make a dynamic calendar that displays RARE, HARD, MEDIUM, EASY, VERY EASY, depending on how difficult's it is to see a certian species on a certain month (not place). We have places grouped by city. So it will be a calenar for each city (each city may have hundreds of places). – Stephen H. Anderson May 22 '19 at 20:49
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/93983/discussion-between-andre-b-and-stephen-h-anderson). – André.B May 22 '19 at 20:53
1

You could potentially use a Poisson regression to model monthly counts of birds of a particular species as a function of explanatory/independent variables that you think affect the monthly rate of observations, such as linear time trends/calendar month/season/year (potentially supplementing this data with external data sources on things like weather).

After estimating the regression, you could then compute probabilities. As Poisson is a discrete probability distribution, you can calculate the probability of X (specify desired count here) of events for specified values of predictors you used in the regression. If you wanted to know the probability of seeing at least one bird, you can compute it as 1 minus the probability of 0 birds seen. Here is another CV Q&A on predicting probabilities with Poisson regression: Poisson Regression : expectation vs probability for each outcome

For this type of regression, you would normally be assuming that counts in consecutive months are not correlated, meaning that you do not have autocorrelation. If that assumption is not valid, which it probably is not, given that you have time series data, you can examine this Q&A for potential approaches: Poisson regression with (auto-correlated) time series

Poisson regression has a couple of well-known relatives: Negative Binomial and Zero-Inflated Poisson (also zero-truncated Poisson, but that one should not apply here, as you should have a non-zero probability of 0 birds observed). Negative Binomial is needed when the variance of data exceeds the mean in a given month. Zero-Inflated Poisson is used when there are "excess" zeros in the data generated by a separate process.

AlexK
  • 1,007
  • 4
  • 11