5

I am a PhD student.

I have a data set (waiting time in minutes between tweets) which looks almost symmetrically to the naked eye.

I've tried a couple of distribution fits to this data and the closest I can get is a cauchy distribution with a p value = 0.02. I'd like to see if the data is a reasonable fit to a Laplace / Double Exponential Distribution.

mydata = c(251, 178, 342, 252, 253, 213, 335, 273, 250, 325, 253, 252, 254, 252, 240, 248, 
       253, 250, 250, 247, 257, 259, 250, 254, 251, 250, 251, 254, 248, 265, 239, 260, 
       253, 311, 252, 311, 250, 249, 251, 212, 289, 243, 253, 252, 254, 249, 250,
       259, 268, 346, 312, 263, 287, 281, 334, 239, 218, 280,   5, 255, 251, 255, 
       266, 325, 248, 249, 250, 251, 171, 326, 195, 198, 281, 271, 265, 267, 250, 251, 
       278, 264, 252, 265, 250, 243, 267, 250, 252, 253, 244, 252, 259, 132, 275, 182,
       336, 250, 251, 253, 358, 252, 276, 281, 255, 252, 191, 277, 283, 193, 213, 
       268, 277, 250, 236, 241, 296, 242, 249, 251, 250, 262, 250, 219, 263, 267, 245, 
       254, 251, 251, 234, 259, 264, 261, 246, 254, 264, 276, 236, 245, 253, 222, 240,
       250, 250, 252, 239, 254, 250, 263, 267, 251, 255, 256, 252, 243, 257,
       251, 252, 252, 242, 229, 250, 265, 252, 237, 270, 212, 268, 290, 256, 239, 239, 
       263, 251, 248, 252, 249, 241, 268, 261, 254, 256, 258, 250, 251, 250, 259, 257, 197,
       282, 461, 257, 250, 250, 250, 251, 253, 253, 251, 250, 263, 247, 254, 
       251, 256, 250, 250, 177, 305, 275, 203, 260, 250, 251, 252, 239, 274, 167, 262, 
       251, 272, 251, 264, 250, 256, 226, 257, 270, 240, 239, 255)

Which gives an interesting histogram

hist(mydata, breaks = 25, freq =F)

I looked at a couple of prior posts on this subject and they appear to use the nls function Double exponential fit in R however the data that is used appears to have two variable to model.

I then looked at converting my data to a frequency table and trying the above solution:

nonlin <- function(t, a, b, c) { a * (exp(-(abs(t-c) / b))) }
nlsfit <- nls(Freq ~ nonlin(times, a, b, c), data=mydataframe, start=list(a=2, b=2, c=2.5))

However I get an error (singular gradient matrix at initial parameter estimates) as the starting parameters appear bad. Then i found another post Why is nls() giving me "singular gradient matrix at initial parameter estimates" errors? that uses some code to estimate the starting parameters.

c.0 <- min(mydataframe$times) * 0.5
model.0 <- lm(log(times - c.0) ~ Freq, data=mydataframe)
start <- list(a=exp(coef(model.0)[1]), b=coef(model.0)[2], c=c.0)
model <- nls(times ~ a * exp(b * Freq) + c, data = mydataframe, start = start)

I get another error running the last line

Error in nls(times ~ a * exp(b * Freq) + c, data = mydataframe, start =     start) : 
  step factor 0.000488281 reduced below 'minFactor' of 0.000976562

Any help folks can provide is much appreciated. Jonathan

Jonathan Dunne
  • 452
  • 4
  • 15
  • 1
    The maximum likelihood estimator of the location parameter for the double exponential is the median ... – kjetil b halvorsen May 25 '17 at 16:02
  • 1
    What are you going to do with the fitted distribution? *How* are you testing goodness of fit? (e.g. the p-value for the Cauchy) Why are you performing goodness of fit tests on your fitted distributions? No simple distributional form will actually be "correct" so any simple distribution you choose *should* be rejected if your sample is large enough. – Glen_b May 26 '17 at 03:21
  • With a fitted distribution i can use the known properties of the distribution to make inferrences around expected duration between tweets. To test GoF formally i plan to use Anderson-Darling. I want to check there my hypothesis: Is a Laplace distribution a reasonable fit to model the waiting times between tweets. Yes I am aware the fit will not be perfect (i'm not expecting it to be so) I am also aware that the shape of the data may change with a larger sample – Jonathan Dunne May 26 '17 at 09:42
  • An Anderson-Darling -- or any other test -- will tell you if you have a good fit. In large samples like you have it will reject perfectly reasonable approximations (while missing others that may matter more to whatever you're doing). I wouldn't say you have a good fit, but it may be be perfectly adequate for some purposes; Anderson-Darling will reject it easily though. – Glen_b May 26 '17 at 10:18
  • I used a plaplace function to generate an AD GoF and the hypothesis was rejected as follows: AD = 7.6687, p-value = 0.0001638. Interestingly if i chop the data in half at it's median one side fits an exponential very well (p-value = 0.18) but that is another story entirely – Jonathan Dunne May 29 '17 at 14:32

1 Answers1

4

You're trying to fit a density function. The question you link to is trying to fit a curve to y vs x data. They're different exercises. Nonlinear least squares may be suitable for that other question but it's not suitable for your problem.

Let's take the double exponential as a given, with density:

$$f(x;\mu,\tau)=\frac{1}{2\,\tau} \exp \left(-\frac{|x-\mu|}\tau \right) \, ,\: \small{{-\infty<x<\infty,\, -\infty<\mu<\infty, \,\tau>0}}$$

Then a good estimator of the location parameter $\mu$ is the sample median, and a good estimator of the scale $\tau$ is the mean deviation from the median (specifically, these are maximum likelihood estimators).

This would give estimates on your sample of $\hat \mu=252$ and $\hat\tau\approx 17.8$:

m = median(tweettimes)
data.frame(m = m, t = mean(abs(tweettimes-m)) )
    m        t
1 252 17.79565

It seems to fit at least moderately well --

Histogram of data with fitted density and double exponential Q-Q plot

but we can see that the tails of the data are heavier than that of the model.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Thanks Glen, what line of code did you use to overlay the Laplace curve to the histogram did you create a density function for laplace in the form: `hist(tweettimes, breaks = 25, freq = F) curve(dlaplace(x, a = 252, b = 17.79565), col = "red", add = TRUE)` – Jonathan Dunne May 26 '17 at 09:46
  • Something quite close to that, yes. I wrote a function for dlaplace (... `function(x,m=0,t=1) exp(-abs(x-m)/t )/2/t` ...) -- I have a couple already in several packages but by the time I found one and loaded the package I could have written it twice over so I just wrote it. I also wrote a little qlaplace function for the quantile function to use in the second plot; the red line is the fitted Laplace. The x-axis is the theoretical quantiles (unfortunately I didn't use a good label on that axis) – Glen_b May 26 '17 at 09:59