6

The given data below shows the time in intervals in seconds between successive white cars in flowing traffic in an open road.Can these be modeled by an exponential distribution.

Time        0-     20-      40-     60-      90-     120-180
Frequency   41     19       16      13        9        2

My question is when I calculate Expected frequencies though the total should add up to 100 it does not do so because of rounding off errors right?
I get a value of 98.89.
Here should I need to add another category myself as the interval being 180-infinity and get its expected value as 100-98.89=1.11. Is it necessary to add this category as this is an exponential distribution which goes for infinity?
Of course as the Expected value of the created category is less than 5, in this case I have to add this up with the upper category.
But if I do not consider this new category then the degrees of freedom changes.Is it necessary to add this new category as 180-infinity

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
clarkson
  • 1,045
  • 3
  • 13
  • 24
  • 1
    Let's back-up: do you have the raw data, because if so, we wouldn't start from here? – Nick Cox Dec 06 '13 at 14:52
  • Raw data?The data for the problem is as given above – clarkson Dec 06 '13 at 14:56
  • 1
    The raw data would be the individual times. This is a coarsely binned distribution, which is less satisfactory. These "Frequencies" look like percents, not counts, as they add to 100. You need the total number of measurements to apply a goodness of fit test. Where does 180 come from? You need at least one bin 120 up. Also, we can't comment on your calculations if you don't show us what they are. – Nick Cox Dec 06 '13 at 15:02
  • 1
    Unless you are using a broken abacus, the sum of expected frequencies should be indistinguishable from $1$. But for us to help you, you will need to describe how you are computing the expected frequencies. As far as the degrees of freedom go, the correct value depends on how the bin endpoints ($0,20,40,60,80,90,120$) were chosen: if they were fixed before observing the data, you're fine; but if they were determined based on the data (as it would seem from their irregular spacing), [the situation is complicated](http://stats.stackexchange.com/a/17148/919). And please heed @Nick Cox's caution. – whuber Dec 06 '13 at 15:05
  • Sorry I had made a mistake in writing the data values.It is now corrected.I am computing the expected frequencies by calculating the probability at each interval.Under null hypothesis as this follows an exponential distribution for the first category I integrated from 0-20.Then Expected frequency for that category is Probability*total frequency=0.3935*100.The lamda value for the exponential distribution is found by supposing this equals sample mean.My H0 is Data follows an exponential distribution – clarkson Dec 06 '13 at 15:12
  • Thanks for more details, but I don't see where the sample mean comes from or even what it is. To repeat: if the raw data are accessible, there are much better ways to do this. Binning is not only not needed, but also wasteful of the information in the data. You haven't clarified the ambiguity about whether the total of 100 means that you are working in percents, or it was the exact number of cars measured. – Nick Cox Dec 06 '13 at 15:23
  • 100 is the exact number of cars measured. Sample mean=Sigma f*x/Sigma f. If this data follows an exponential distribution with parameter lamda as expected value=1/lamda, 1/lamda=(Sigma f*x)/(Sigma f) – clarkson Dec 06 '13 at 15:31
  • 1
    Is this seemingly textbook-style question for some subject? Is it for your own study? – Glen_b Dec 06 '13 at 15:47
  • 2
    This looks like a homework problem, which would explain why raw data aren't available and why the problem seems narrowly framed. If this were a real data analysis, I would (1) probably include the additional ">180" empty bin; (2) figure that it was unlikely to make much difference because it's a small fraction of the data; (3) worry about computing the mean of the exponential from the midpoints of the bins. If it were a homework problem, I would ask the instructor for clarification. – Ben Bolker Dec 06 '13 at 15:48
  • Picky of me, but for "lamda" read "lambda" throughout. – Nick Cox Dec 06 '13 at 15:51
  • 1
    This is not real data analysis.Just a homework problem to understand how goodness of fit test is done. – clarkson Dec 06 '13 at 16:00
  • The final question about whether to add the $(180,\infty)$ bin is interesting. It turns out not to matter much, because the $\chi^2$ statistic barely changes. It creates problems, though, because the expected count in that bin is only $0.96$: too low for the $\chi^2$ distribution to be a good approximation to the statistic. Thus, a permutation test would need to be applied to compute the p-value. You're better off just considering the last bin to extend from $120$ to $\infty$. – whuber Jun 27 '14 at 17:01

1 Answers1

4

Firstly, Rearrange the table so it makes better sense: and calculate mean via calculator.calculator. In order to do this take the midpoint of each interval for time and enter this on the display alongside the frequencies. you should get 40. this is the MEAN however, so to get lambda, use the following formula lambda = 1/mean this should give you a lambda value of 0.025. use this and the cumulative pdf (1-e-^lambda x time interval) function to calculate p(X=x) for each and x by total to get expected for each. Because we are dealing with a continuous distribution, the first and last intervals should be calculated as less than 20 and more than 120.i.e first calculation will be 1-e-^0.025x20= 0.3935 (keep to at least 4dp).. second will be (1-e-^0.025x40) - (1-e-^0.025x20) = 0.2387.. carry on until ll intervals have been done as shown above. times each p(X=x) by 100. keep expected to 2dp.

Time           0-20   20-40    40-60    60-90   90-120       120-180     total
Frequency        41      19       16       13        9       (11)  2       100
p(X=x)       0.3935  0.2387   0.1447   0.1177   0.0556        0.0497         1 (about!)
Expected      39.35   23.87    14.47    11.77     5.56  (10.53) 4.97   (combine last two)
(O-E)^2/E   0.06919 0.99359  0.16178  0.12854  0.02098    1.37 (3sf)

H0: Data follows an Exponential Distribution
H1: Data does not follow an Exponential Distribution (i.e Use 5% SL)

DF = 5-1-1 = 3 (1 for totals and 1 for estimating lambda from population)
X (c.v)= 7.815

1.37 < 7.815 so accept H0. Conclude time in intervals in seconds between successive white cars in flowing traffic in an open road can be modeled by an exponential distribution.

I did A2 Statistics, a reliable source in itself.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Shubz
  • 41
  • 2
  • (+1) This is basically a good summary--although it would be nice to see some abbreviations explained--but there is one small technical fault: for the chi-squared distribution to be valid, the exponential parameter has to be fit using maximum likelihood rather than this approximate method of moments. It turns out also that your result is fairly sensitive to the *post hoc* decision to combine the last two categories. These increase the $\chi^2$ statistic to $5.402$ (based on an estimated exponential rate of $\hat\lambda=0.02556$), corresponding to $p=0.37$--still not significant by any standard. – whuber Jun 27 '14 at 16:57
  • (+1) How exactly are you ending up with a mean of 40? I don't fully understand your arguments for that. – k.dkhk Jun 11 '17 at 17:11