6

Here's a series of data I'm observing:

1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1

How do I use math to predict whether the next number in the series will be a 1 or a 0?

Jim
  • 61
  • 2
  • Computational mechanics is concerned with precisely this kind of question: Crutchfield, J. P. and Feldman, D. P. (2003). Regularities unseen, randomness observed: Levels of entropy convergence. _Chaos_, 13(1):25–54. – Alexis Apr 25 '14 at 19:47
  • 2
    Are all the observations independent? – Patrick Coulombe Apr 25 '14 at 19:53
  • Yes, observations are independent. – Jim Apr 25 '14 at 19:56
  • 4
    If observations are independent, you can't do better than predicting next number to be the number that happened most often. – Akavall Apr 25 '14 at 20:27
  • And if you do not know if the observations are independent, computational mechanics gives strategies for parsing uncertainty into deterministic and stochastic processes, and for putting bounds on uncertainty for which the process cannot be detremined (e.g. entropy). – Alexis Apr 25 '14 at 20:55
  • 1
    What do you know about the data generating process? – David LeBauer Apr 25 '14 at 21:57
  • I'm not quite sure @David, but I think it's something along the lines as this: every morning a person passes by a neighborhood sidewalk newspaper stand. Sometimes the person buys a newspaper, some days not. For the days she does buy a paper, the number one is recorded. For the days she doesn't, a 0 is recorded. How can I predict whether that person will buy a paper tomorrow morning? – Jim Apr 25 '14 at 22:03
  • Thanks that is very helpful. Where are you getting the data? Is it a five day week or a seven day week? If it is a five day week, it seems she haven't missed a Tuesday, Wednesday, or Thursday, has missed 2/4 Fridays and 1/5 Fridays. Since the next day would be Wednesday, this information is useful. – David LeBauer Apr 25 '14 at 22:41
  • @David. It's a seven day week, but she doesn't pass by every day. We only count the days in which she does pass by. – Jim Apr 25 '14 at 23:06
  • What is the basis on which to assert those paper-buying events will be independent? – Glen_b Apr 26 '14 at 00:37
  • I think your best bet would be to just ask! Otherwise, if observations are independent, isn't there a 19/22 chance the next number is 1? In that case, I'd go with 1, as pointed out by @PatrickCoulombe – David LeBauer Apr 26 '14 at 02:16
  • @David For the record, Akavall is the one who suggested predicting 1 (since it's the number that occurs most often) – Patrick Coulombe Apr 26 '14 at 02:25
  • It becomes a completely different question when you know the observations are tied to day of the week. (And actually three-state: no-show, no-buy, buy). (And also no longer independent events - e.g. maybe she can only afford to buy 4 newspapers each week) – Darren Cook May 02 '14 at 00:05

1 Answers1

6

If observations are independent, and if values must either be 1 or 0, with no additional prior information, you may simply assume that the probability that the next value is 1 is equal to the proportion of 1s in the observations.

If you wish to calculate a confidence interval around this estimate, this could reasonably be modeled as a Bernoulli trial with probability $p=19/22\simeq0.86$ And a 95% confidence interval of $[65\%,97\%]$ (CI calculated as the Clopper-Pearson interval).

This model is analogous to expecting heads from a coin that has landed on heads in 19 of 22 flips, or drawing a white pebble from a bag where the previous 22 draws gave 19 white + 3 nonwhite pebbles (if the pebbles are put back each time, or if there are infinite well mixed pebbles).

See also https://stats.stackexchange.com/a/6184/1381 for information and alternative methods for computing confidence intervals for Bernoulli trials.

Given the number of up votes on the OP, perhaps there is a less trivial solution, but I suspect that it just looks like it would be interesting if the observations were related, and order mattered, rather than being independent.

David LeBauer
  • 7,060
  • 6
  • 44
  • 89
  • That seems to depend entirely on your definition of chance! – jsk Apr 28 '14 at 05:20
  • @jsk I a not sure how else to interpret chance in this context, though I did edit for clarity. – David LeBauer Apr 28 '14 at 05:43
  • 1
    Just because your best guess for p based on the assumptions of independence and your model is 19/22 does not mean that the chance that the next observation is 1 is 19/22. – jsk Apr 28 '14 at 06:34
  • @jsk what is it then? Lets use the word "probability" instead of "chance", because "chance" may have its own ambiguously embedded assumptions, while the probability of an observation given the model can be clearly defined. Certainly with additional information this probability could change, but given available information, I am not sure how else to compute P(observation|model). – David LeBauer Apr 28 '14 at 16:26
  • 1
    The point was not word-choice of chance versus probability. The point is that your best guess under your model is that the probability that the next event is a 1 is 19/22 is a lot different than claiming the ACTUAL probability of the next event being a 1 is 19/22. – jsk Apr 28 '14 at 18:33
  • @jsk ok. I updated the answer to say "Given these assumptions ..." – David LeBauer Apr 28 '14 at 19:00
  • 1
    I think what @jsk might be trying to get at is the explanation in this answer conflates a data-based statistic with a model. This has the potential to confuse careful readers who have learned the importance of distinguishing between statistics, estimators, and model parameters. Even when the observations are independent more can be said in this situation, because a good prediction consists not only of an estimate like "86% chance" but also provides a range of uncertainty around that estimate. – whuber Apr 28 '14 at 19:57
  • @whuber hanks for clarifying and keeping me on my toes. I have tried to straighten out these concepts. Any further improvements you could suggest? – David LeBauer Apr 29 '14 at 02:31
  • The confidence interval is around the estimate of the mean. This is not the same as confidence of our expectation of the next observation - which will either be 1 (p=0.86) or 0 (p=0.14). It can't be 0.65, or 0.86 or 0.97. There is no range of values that the next observation can take, other than 1 or 0. – david25272 Apr 29 '14 at 05:01
  • @david25272 does my answer imply otherwise? – David LeBauer Apr 29 '14 at 05:08
  • I'm not sure! You're stating a confidence interval around an estimate of the mean, which is quite correct, but I'm not sure of it's relevance for an estimate of the next observation. – david25272 Apr 29 '14 at 05:11
  • @david25272 the mean is the probability that the next answer is 1 – David LeBauer Apr 29 '14 at 05:12
  • Yes, the expectation of the next observation is 1*0.86 + 0*0.14. If we predict a 1 (86% of the time) there is a 14% change we are wrong. If we predict a 0 (14% of the time) there is a 86% chance we are wrong. In both cases the odds of being wrong (without knowing the prediction or the outcome in advance) is 0.86*0.14. In other words the mean is 0.86, the variance is 0.86*0.14. But the confidence interval you stated relates to the odds of the true mean lying outside the range stated for the estimated mean (assuming the mean is in fact an estimate). – david25272 Apr 29 '14 at 05:29
  • In other words, the confidence interval around an estimate of the mean is not the same as the the interval around a point estimate. – david25272 Apr 29 '14 at 05:34
  • Thanks, David: +1. It might be worth pointing out that your approach is not a "simple assumption": your estimator of $19/22$ is one of many valid (admissible) estimators. For instance, the estimate $(19+1)/(22+2)$ is admissible, too (and has some nice properties). Thus, under the assumptions you make, your solution is not unique--and this fact may be worth emphasizing. Another thing: with a longer series of data it would be wise to test for independence, as a reality check on that assumption. – whuber Apr 29 '14 at 15:43
  • 1
    @whuber Indeed. Thanks for clarifying the point I was trying to make. In regard to the assumptions of the model, the assumption that someone's daily decisions can be modeled as independent trials with the same probability on each trial troubles me. – jsk Apr 29 '14 at 17:37