1

I have 6 sets of Volume(v) & Duration(d) data. I have fitted a quite few distributions to the data such as Weibull, Gamma, Log-Normal, Exponential, GEV, Pareto, Log Logistic, Poisson, and GP. This is one of the data set:

                    d            v
  [1,]              4         48.0
  [2,]             16         73.6
  [3,]              4         52.4
  [4,]              2         62.0
  [5,]             10         48.5
  [6,]             28         99.3
  [7,]              6         49.5
  [8,]             15         61.0
  [9,]              8         56.5
 [10,]             11         52.5
 [11,]             11         55.5
 [12,]              8         89.4
 [13,]             18         54.5
 [14,]              5         56.5
 [15,]              3         67.6
 [16,]              6         51.1
 [17,]              5        112.0
 [18,]             10         51.0
 [19,]             10         50.6
 [20,]             10         52.0
 [21,]              2         77.5
 [22,]              2         53.0
 [23,]              3         56.0
 [24,]              9         51.6
 [25,]              2         50.0
 [26,]              7        103.9
 [27,]              4         50.1
 [28,]              4         51.5
 [29,]              5         55.1
 [30,]             17         64.4
 [31,]             11         54.9
 [32,]              7         89.5
 [33,]              9         50.0
 [34,]             10         50.9
 [35,]              3         56.5
 [36,]              6         54.0
 [37,]              5         49.0
 [38,]              8         50.0
 [39,]              2         51.0
 [40,]              9         66.0
 [41,]              5         57.9
 [42,]              9         57.5
 [43,]             15         48.0
 [44,]              8         64.0
 [45,]              4         52.0
 [46,]              4         54.5
 [47,]              4         70.5
 [48,]              4         51.4
 [49,]              4         86.0
 [50,]              5         70.5
 [51,]              2         61.5
 [52,]             11         76.9
 [53,]             12         69.6
 [54,]              6         47.9
 [55,]              4         64.5
 [56,]              4         62.5
 [57,]              8         72.9
 [58,]              4         53.5
 [59,]              9         81.4
 [60,]             23         53.5
 [61,]              8         77.0
 [62,]              8         71.5
 [63,]              5         87.5
 [64,]             13         67.5
 [65,]              9         66.0
 [66,]              8        139.0
 [67,]              5         54.0
 [68,]             15         61.5
 [69,]              9         59.5
 [70,]              7         77.7
 [71,]             13         50.5
 [72,]             22         48.4
 [73,]              6         68.9
 [74,]              4         53.5
 [75,]              2         49.5
 [76,]              5         49.6
 [77,]              6         51.1
 [78,]             15         67.0
 [79,]              6         58.0
 [80,]              7         51.0
 [81,]             10         64.0
 [82,]              8         58.8
 [83,]             16        102.9
 [84,]              3         61.0
 [85,]             35         54.6
 [86,]             39        107.1
 [87,]              3         49.0
 [88,]              8         53.0
 [89,]             20         52.1
 [90,]             22         65.5
 [91,]             18         50.9
 [92,]             13         51.7
 [93,]             17         77.4
 [94,]             11         75.9
 [95,]              3         63.5
 [96,]             38        120.3
 [97,]              4         69.0
 [98,]              3         68.5
 [99,]             47         63.8
[100,]             72         91.2
[101,]             72         84.0
[102,]              9         57.5
[103,]              5         68.5
[104,]             48         88.8
[105,]              8         54.5
[106,]              3         74.5
[107,]             11         62.2
[108,]              3         65.5
[109,]             55         50.8
[110,]             48         96.0
[111,]             96         62.4
[112,]             54        111.4
[113,]             18         52.0
[114,]             48         79.2
[115,]             48         79.2
[116,]             72        144.0
[117,]              6         54.0
[118,]              5         78.0
[119,]              5         77.0
[120,]             16         51.3
[121,]              3         65.0
[122,]              8         64.5
[123,]              7         79.6
[124,]              4         48.9
[125,]              8         76.6
[126,]              6         50.5
[127,]              4         52.6
[128,]              3         81.1
[129,]              6         65.5
[130,]              7         61.0
[131,]              6         54.9
[132,]              2         57.5
[133,]              9         60.0
[134,]             10         54.0
[135,]              2         50.0
[136,]              5         57.5
[137,]              9         65.0
[138,]             10         50.6
[139,]              5         63.5
[140,]              7         62.6
[141,]              5        100.0
[142,]              2         49.5
[143,]              6         72.0
[144,]              5         81.5
[145,]              6         48.3
[146,]              4         49.0
[147,]             11         69.0
[148,]              7         49.0
[149,]             19         49.1
[150,]             11         75.5
[151,]              2         63.0
[152,]              5         74.5
[153,]              3         58.6
[154,]              5         49.4
[155,]             11         52.0
[156,]              2         50.0
[157,]              3        101.0
[158,]              8         72.5
[159,]              7         48.1
[160,]              2         51.0
[161,]             11         60.5
[162,]             11         50.1
[163,]              2         62.0
[164,]             10         51.6
[165,]              9         49.6
[166,]              3         56.1
[167,]             16         80.1
[168,]              6         81.4
[169,]              2         48.0
[170,]              4         52.5
[171,]              4         49.9
[172,]             19         63.1
[173,]             40         81.9
[174,]             12        105.5
[175,]              5         85.0
[176,]              6         56.4
[177,]              6         49.6
[178,]              5         64.1
[179,]             13         48.6
[180,]              8         54.5
[181,]              7         75.0
[182,]              7         64.5
[183,]              3         64.9
[184,]              3         54.6
[185,]              5         86.5
[186,]              2         51.0
[187,]              5         52.4
[188,]              3         55.0
[189,]              9         50.5
[190,]              9         96.0
[191,]              7         50.5
[192,]              2         49.5
[193,]              3         55.9
[194,]             13         65.0
[195,]              5         60.9
[196,]              6         49.0
[197,]             10         49.6
[198,]              2         60.5
[199,]              8         55.4
[200,]              4        107.5
[201,]              3         60.1
[202,]              8         64.5
[203,]              5         51.6
[204,]              3         54.0
[205,]              6         76.0
[206,]              3         64.5
[207,]              3         63.0
[208,]              6         73.0
[209,]             12         90.0
[210,]              5         62.0
[211,]              3         70.5
[212,]              3         95.0
[213,]             11         77.5
[214,]              5         61.1
[215,]              2         60.0
[216,]              2         48.0
[217,]              7         94.5
[218,]              7         68.0
[219,]              8         79.5
[220,]              4         60.4
[221,]              8         75.0
[222,]              5         55.0
[223,]             18         55.0
[224,]              2         67.0
[225,]              8        158.0
[226,]              7         91.5
[227,]              9         61.5
[228,]              4         73.0
[229,]              7         79.0
[230,]              2         67.5
[231,]              3         58.0
[232,]              6        102.5
[233,]              8         87.0
[234,]              8         74.5
[235,]              4         55.5
[236,]             18        112.5
[237,]             12         75.5
[238,]              3         57.5
[239,]              4         48.5
[240,]              5         55.0
[241,]             14         61.0
[242,]              8         85.4
[243,]              7         79.5
[244,]              5         59.5
[245,]              4         48.0
[246,]              3         72.0
[247,]              7         61.0
[248,]             13         50.0
[249,]              4         55.5
[250,]              2         48.0
[251,]              3         88.0
[252,]              9         55.5
[253,]              4        108.0
[254,]              7         52.6
[255,]              1         99.5
[256,]              2         60.0
[257,]             10        100.0
[258,]              2         53.5
[259,]              4         83.5
[260,]             12         83.0
[261,]              9         56.8
[262,]             15         68.1
[263,]              7        126.6
[264,]              6         54.5
[265,]              7         59.4
[266,]              9         59.1
[267,]              6         50.0
[268,]              6         52.5
[269,]              7         67.0
[270,]              4        129.0
[271,]             20         81.5
[272,]             19         57.5
[273,]              9         54.5
[274,]              6         55.5
[275,]              5         65.0
[276,]              4         53.0
[277,]              9         77.1
[278,]              7         81.5
[279,]              6         72.6
[280,]              6         61.4
[281,]              3         58.0
[282,]              3         59.5
[283,]              4         56.5
[284,]              4        126.1
[285,]              3         77.5
[286,]              3         84.5
[287,]             11         56.0
[288,]              2         62.0
[289,]              3         74.5
[290,]              5         82.0
[291,]              5         52.5
[292,]              8         52.5
[293,]             11         78.0
[294,]              2         57.5
[295,]             14         55.0
[296,]             14         59.5
[297,]              3         51.0
[298,]              2         52.5
[299,]              6         60.0
[300,]              6         88.5
[301,]              4         52.0
[302,]              3         56.0
[303,]              4         59.0
[304,]              3         87.0
[305,]              3         65.5
[306,]              6        108.5
[307,]              6         57.0
[308,]             17         52.0
[309,]              9         62.0
[310,]              7         56.0
[311,]             12         64.0
[312,]              7         54.0
[313,]             31         92.5
[314,]              8         73.0
[315,]              7         55.0
[316,]             26         73.5
[317,]             63         76.5
[318,]            315        117.5
[319,]             12         73.5
[320,]              5         54.0
[321,]              2         58.5
[322,]              7         83.0
[323,]              3         53.0
[324,]              3         48.0
[325,]             10         78.5
[326,]              3         72.5
[327,]              2         52.0
[328,]              4         57.0
[329,]              6         55.5
[330,]              7         57.0
[331,]              6         53.0
[332,]             13         52.5
[333,]              9         59.5
[334,]              8         79.0
[335,]              4         67.0
[336,]              8         73.0
[337,]              7         62.5
[338,]              4         80.5
[339,]              3         54.0
[340,]              6         58.0
[341,]              6         98.0
[342,]              2         49.0
[343,]              4         52.5
[344,]              2         55.0
[345,]             17         58.0
[346,]             13         80.0
[347,]             11         60.0
[348,]              3         83.5
[349,]              8         75.5
[350,]              4         67.0

I'm using fevd function in extRemes package to fit GEV and GP and
fitdist function in fitdistrplus package to fit other distribution. The coding basically like this

fw1 <- fitdist(d, "weibull")
fw2 <- fitdist(v, "weibull")

fit1 <- fevd(d, type="GEV")
fit5 <- fevd(v, type="GEV")

but none of the distributions can fit my data. Anyone can help me with the coding/ R? what distributions suitable for my data? what other distributions that I can try? I also try this code. This is the first time I've done this and I'm not familiar with the distributions. Thank you for your help!

EDIT:

Mia
  • 31
  • 5
  • 2
    Why do you need to fit a distribution? What are you doing with it? – Glen_b Jul 01 '20 at 03:20
  • because later I want to estimate Copula under a Parametric assumption for my study. Really need help @Glen_b – Mia Jul 01 '20 at 06:14
  • Can you provide histograms of the data? Somebody might be able to suggest a suitable family with help from a histogram. – jcken Jul 01 '20 at 07:30
  • I just realize that I give different data. I just edit it and add histograms @jcken – Mia Jul 01 '20 at 09:14
  • From the histograms, I would check log-normal and Gamma for "volume" and negative binomial for "duration". –  Jul 01 '20 at 10:07
  • I wouldn't. Neither lognormal or gamma will work for volume when the minimum is so high compared to the median (you could also see it by taking logs and noticing that it will still be right skew);. A count distribution for a duration doesn't really make a lot of sense to me – Glen_b Jul 06 '20 at 06:24
  • Hi, this is the value at the 95th percentile of the _volume_ threshold. The _Pareto_ fit just fine to the _volume_. The _lognormal_ and _GEV_ seem to fit the _duration_ but with low _p-values 0.008_ and _0.02_ and Idk if that's fit enough for me to use the distributions @Glen_b – Mia Jul 08 '20 at 01:54
  • The shifted (three parameter) lognormal or gamma would probably fit considerably better – Glen_b Jul 09 '20 at 01:44

1 Answers1

3

EDIT: an important warning about choosing the likelihood from the density plot of the data. In these cases, there is the risk of overfitting the data. Here a good answer about how to reduce this risk https://stats.stackexchange.com/a/20738/289381 (Gelman's advise is to use a leave-one-out cross-validation).

From the density plot of the "volume" variable, its distribution is a mixture of two log-normal (notice the right shoulder "peak"):

enter image description here

You can find some details about a mixture of two Gaussians here https://stats.stackexchange.com/a/474775/289381

This is how I fitted its parameters using a Bayesian approach using the R package brms

library(brms)

dat <- read.csv("data.csv")
colnames(dat) <- c("d", "v")

mix <- mixture("lognormal", "lognormal")
mdl_1 <- brm(v ~ 1, data=dat, family=mix)  # Using the default priors

This is the plot of the posteriors (the names of the parameters are quite clear, $\theta$ represents the probability to belong to the first or the second component of the mixture):

plot(mdl_1, N=6)

enter image description here

This is the posterior predictive check that shows that the fitted distribution capture the data reasonably well:

pp_check(mdl_1, nsamples = 50)

enter image description here

duration

The data looks highly skewed to the right. I am not sure if the outliers are expected or are the result of something that went wrong.

The log-normal distribution seems to fit well the data as you can see here from the posterior predictive distribution

enter image description here

These are the posterior for the mean and st.dev. of the log-normal distribution:

enter image description here

This is the code (using brms):

mdl_ln <- brm(d ~ 1, data=dat, family="lognormal")
plot(mdl_ln)
pp_check(mdl_ln, nsamples = 50)
  • Hi, thank you for your answer! I've tried to fit a negative binomial using the `fitdist` function but I get very low _p-value_ for the chi-square test. I never try the Bayesian approach and mixture models. How do I generate data from the model? – Mia Jul 01 '20 at 16:23
  • and how to read the plot? is theta1 is for the first peak? the value is between 0.3 and 0.6? – Mia Jul 01 '20 at 16:37
  • You have a couple of strong outliers in "d". Depending on what you are trying to achieve with the data, there are multiple approaches. It depends on the information you have about the process that has generated the data. Are these extreme values expected? About $\theta$, the distribution for "volume" is a mixture of two log-normal distributions with mean $\mu_1$, $\mu_2$, and variance $\sigma_1^2$, $\sigma_2^2$. $\theta$ is the mixing weight, such that $f(y) = \theta_1 * f_1(y|\mu_1, \sigma_1^2) + \theta_2 * f_2(y|\mu_2, \sigma_2^2)$. –  Jul 01 '20 at 17:11
  • Although I am not a fan of data transformations, you can try to fit $log(d)$. A normal distribution would work, even though you still have another peak to the right (check with `plot(density(log(dat$d))`). Another option is fitting a log-normal distribution (without transforming the data). This seems to work fine. Remember that log-normal and normal of log are different things. –  Jul 01 '20 at 19:23
  • thanks for the explanation! ya, actually this is extreme values at the 95th percentile of _volume_ threshold. I was thinking about data transformation but didn't sure about that. How do I know if it is okay to transform data? I've tried to fit a log-normal but it doesn't fit and it fit just fine to a Pareto, I'll check again. – Mia Jul 02 '20 at 00:31
  • Have a look at my updated answer above. It’s always ok to transform the data: if it makes sense for your data, you keep it in mind when you interpret the distribution parameters. –  Jul 02 '20 at 08:46
  • Hi! Thanks for your answer. The Pareto fit _volume_ just fine not *_duration_, I said it wrong. Are there any distributions that I can try other than _lognormal_ and _GEV_ for _duration_? I've tried _GEV_, it seems to fit well like _lognormal_ – Mia Jul 08 '20 at 01:35
  • Choosing the likelihood for your data should be based on general principles underlying the data generation process. I would avoid to base it on how nicely it fits the data, because there’s a risk of overfitting. You may want to think about how the data has been generated and which distribution captures that process. This will help you to understand if the extreme values are outliers or not. –  Jul 08 '20 at 08:42
  • One way to check which distribution works "better" is by leave-one-out (LOO) cross-validation. You can fit two `brms` models using different distributions and then check which one fits the data better https://paul-buerkner.github.io/brms/reference/loo.brmsfit.html –  Jul 08 '20 at 09:31