14

Load the package needed.

library(ggplot2)
library(MASS)

Generate 10,000 numbers fitted to gamma distribution.

x <- round(rgamma(100000,shape = 2,rate = 0.2),1)
x <- x[which(x>0)]

Draw the probability density function, supposed we don't know which distribution x fitted to.

t1 <- as.data.frame(table(x))
names(t1) <- c("x","y")
t1 <- transform(t1,x=as.numeric(as.character(x)))
t1$y <- t1$y/sum(t1[,2])
ggplot() + 
  geom_point(data = t1,aes(x = x,y = y)) + 
  theme_classic()

pdf

From the graph, we can learn that the distribution of x is quite like gamma distribution, so we use fitdistr() in package MASS to get the parameters of shape and rate of gamma distribution.

fitdistr(x,"gamma") 
##       output 
##       shape           rate    
##   2.0108224880   0.2011198260 
##  (0.0083543575) (0.0009483429)

Draw the actual point(black dot) and fitted graph(red line) in the same plot, and here is the question, please look the plot first.

ggplot() + 
  geom_point(data = t1,aes(x = x,y = y)) +     
  geom_line(aes(x=t1[,1],y=dgamma(t1[,1],2,0.2)),color="red") + 
  theme_classic()

fitted graph

I have two questions:

  1. The real parameters are shape=2, rate=0.2, and the parameters I use the function fitdistr() to get are shape=2.01, rate=0.20. These two are nearly the same, but why the fitted graph don't fit the actual point well, there must be something wrong in the fitted graph, or the way I draw the fitted graph and actual points is totally wrong, what should I do?

  2. After I get the parameter of the model I establish, in which way I evaluate the model, something like RSS(residual square sum) for linear model, or the p-value of shapiro.test() , ks.test() and other test?

I am poor in statistical knowledge, could you kindly help me out?

ps: I have search in the Google, stackoverflow and CV many times, but found nothing related to this problem

Ling Zhang
  • 143
  • 1
  • 1
  • 6
  • 1
    I first asked this question in stackoverflow, but it seemed to be that this question belongs to CV, the friend said I misunderstood the probability mass function and probability density function, I could not grasp it completely, so forgive me for answering this question again in CV – Ling Zhang Jan 13 '16 at 07:39
  • 1
    Your calculation of densities is incorrect. A simple way to calculate is `h –  Jan 13 '16 at 08:31
  • @Pascal you are right, I have solved Q1, thank you! – Ling Zhang Jan 13 '16 at 08:50
  • See answer below, `density` function is a useful one. –  Jan 13 '16 at 08:59
  • I get it, thank you again for editing and solving my question – Ling Zhang Jan 13 '16 at 10:26

2 Answers2

16

Question 1

The way you calculate the density by hand seems wrong. There's no need for rounding the random numbers from the gamma distribution. As @Pascal noted, you can use a histogram to plot the density of the points. In the example below, I use the function density to estimate the density and plot it as points. I present the fit both with the points and with the histogram:

library(ggplot2)
library(MASS)

# Generate gamma rvs

x <- rgamma(100000, shape = 2, rate = 0.2)

den <- density(x)

dat <- data.frame(x = den$x, y = den$y)

# Plot density as points

ggplot(data = dat, aes(x = x, y = y)) + 
  geom_point(size = 3) +
  theme_classic()

Gamma density

# Fit parameters (to avoid errors, set lower bounds to zero)

fit.params <- fitdistr(x, "gamma", lower = c(0, 0))

# Plot using density points

ggplot(data = dat, aes(x = x,y = y)) + 
  geom_point(size = 3) +     
  geom_line(aes(x=dat$x, y=dgamma(dat$x,fit.params$estimate["shape"], fit.params$estimate["rate"])), color="red", size = 1) + 
  theme_classic()

Gamma density fit

# Plot using histograms

ggplot(data = dat) +
  geom_histogram(data = as.data.frame(x), aes(x=x, y=..density..)) +
  geom_line(aes(x=dat$x, y=dgamma(dat$x,fit.params$estimate["shape"], fit.params$estimate["rate"])), color="red", size = 1) + 
  theme_classic()

Histogramm with fit

Here is the solution that @Pascal provided:

h <- hist(x, 1000, plot = FALSE)
t1 <- data.frame(x = h$mids, y = h$density)

ggplot(data = t1, aes(x = x, y = y)) + 
  geom_point(size = 3) +     
  geom_line(aes(x=t1$x, y=dgamma(t1$x,fit.params$estimate["shape"], fit.params$estimate["rate"])), color="red", size = 1) + 
  theme_classic()

Histogram density points

Question 2

To assess the goodness of fit I recommend the package fitdistrplus. Here is how it can be used to fit two distributions and compare their fits graphically and numerically. The command gofstat prints out several measures, such as the AIC, BIC and some gof-statistics such as the KS-Test etc. These are mainly used to compare fits of different distributions (in this case gamma versus Weibull). More information can be found in my answer here:

library(fitdistrplus)

x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
       38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
       42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
       49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
       45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
       36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
       38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)

fit.weibull <- fitdist(x, "weibull")
fit.gamma <- fitdist(x, "gamma", lower = c(0, 0))

# Compare fits 

graphically

par(mfrow = c(2, 2))
plot.legend <- c("Weibull", "Gamma")
denscomp(list(fit.weibull, fit.gamma), fitcol = c("red", "blue"), legendtext = plot.legend)
qqcomp(list(fit.weibull, fit.gamma), fitcol = c("red", "blue"), legendtext = plot.legend)
cdfcomp(list(fit.weibull, fit.gamma), fitcol = c("red", "blue"), legendtext = plot.legend)
ppcomp(list(fit.weibull, fit.gamma), fitcol = c("red", "blue"), legendtext = plot.legend)

@NickCox rightfully advises that the QQ-Plot (upper right panel) is the best single graph for judging and comparing fits. Fitted densities are hard to compare. I include the other graphics as well for the sake of completeness.

Compare fits

# Compare goodness of fit

gofstat(list(fit.weibull, fit.gamma))

Goodness-of-fit statistics
                             1-mle-weibull 2-mle-gamma
Kolmogorov-Smirnov statistic    0.06863193   0.1204876
Cramer-von Mises statistic      0.05673634   0.2060789
Anderson-Darling statistic      0.38619340   1.2031051

Goodness-of-fit criteria
                               1-mle-weibull 2-mle-gamma
Aikake's Information Criterion      519.8537    531.5180
Bayesian Information Criterion      524.5151    536.1795
COOLSerdash
  • 25,317
  • 8
  • 73
  • 123
  • cool, I get it, and I get into trouble with the statistical knowledge, thank you for helping me out!Further more, to the Q2, if I want to test the model, what should I do? Do the ks.test(), shapiro.test() to see the p-value, or something else, like AIC()? – Ling Zhang Jan 13 '16 at 08:49
  • 1
    I cannot revise, but you have a problem with the backtick for `fitdistrplus` and `gofstat` in your ansewer –  Jan 13 '16 at 09:31
  • 3
    One-line recommendation: the quantile-quantile plot is the best single graph for this purpose. Comparing observed and fitted densities is hard to do well. For example, it is hard to spot systematic deviations at high values that scientifically and practically are often very important. – Nick Cox Jan 13 '16 at 10:35
  • @NickCox I agree completely. In practice, I would only look at the QQ-Plot to compare fits. I'll add a statement to my answer shortly. – COOLSerdash Jan 13 '16 at 10:46
  • 2
    Glad we agree. The OP starts with 10,000 points. Many problems start with far fewer and then getting a good idea of the density can be problematic. – Nick Cox Jan 13 '16 at 10:50
  • @COOLSerdash thank you, I learn much from your answer. And I know these three tests, they will compare with 0.05, what about AIC, BIC? The bigger, the better? From the graph and data like AIC, BIC, the gamma distribution fits well? what's more, what a great example, weibull and gamma distribution are quite the same – Ling Zhang Jan 13 '16 at 11:30
  • @NickCox these 10,000 points are just the example, due to the privacy, I could not show the raw data directly, which is 100 times bigger than these points. Thank you for solving my question, how kind you are – Ling Zhang Jan 13 '16 at 11:33
  • @COOLSerdash one more question, from the numerical the aspect, I could see which one is better to fit the point. But from the graphical aspect, how could I tell which one is better? – Ling Zhang Jan 13 '16 at 11:37
  • @LingZhang Sure, but the point of a forum is that threads may interest others, and many (most) of those others won't have a million data points. – Nick Cox Jan 13 '16 at 11:38
  • @NickCox I can't agree more, thank you again! One more question, you said the quantile-quantile plot is the best single graph for this purpose. And the weibull and gamma distribution are quite the same, from the example qq-plot, how can I tell which one is better – Ling Zhang Jan 13 '16 at 11:44
  • 1
    @LingZhang To compare fits, you could look at the value of the AIC. The fit with the lowest AIC is preferred. Also, I disagree that the Weibull and Gamma distribution are quite the same in the QQ-Plot. The points of the Weibull fit are closer to the line compared with the Gamma fit, especially at the tails. Correspondingly, the AIC for the Weibull fit is smaller compared to the Gamma fit. – COOLSerdash Jan 13 '16 at 11:47
  • 1
    Straighter is better. Also, see http://stats.stackexchange.com/questions/111010/interpreting-qqplot-is-there-any-rule-of-thumb-to-decide-for-non-normality The principles are the same. Systematic deviation from linearity is a problem. – Nick Cox Jan 13 '16 at 11:48
  • @COOLSerdash thank you, I agree with you completely. Generally, gamma and weibull distribution have similar shape of graph and form of probability density function, so the example you mentioned to compare is quite excellent – Ling Zhang Jan 13 '16 at 12:04
  • @NickCox I am reading the page you mentioned, thank you, both of you. You solve the question bothered me for several days, thank you again. Now, it's the time to learn more basic statistical knowledge – Ling Zhang Jan 13 '16 at 12:08
  • @COOLSerdash using your method to solve my data, now I have a question, `gofstat()` produces the result of Goodness-of-fit statistics, there are three statistic, what can I learn from these three statistics – Ling Zhang Jan 15 '16 at 07:31
  • @LingZhang Please consult the [manual](https://cran.r-project.org/web/packages/fitdistrplus/fitdistrplus.pdf) of the `fitdistrplus' package for more information on `gofstat' (page 33). You can also google these tests or search them on this site. – COOLSerdash Jan 15 '16 at 08:39
  • @COOLSerdash the AIC and BIC values `gofstat` generated are quite big and similar, how can I calculate the AIC weigths which the range is between 0 and 1. e.g, there are three distribution: gamma,weibull,exponential, if gamma distribution fits well, the AIC weights equal to 1, expoential does not fit well, the AIC weights equal to 0, and weibull might be some values between 0 and 1? – Ling Zhang Jan 16 '16 at 03:05
  • @NickCox what about your opinions about the AIC weights I mentioned above? – Ling Zhang Jan 16 '16 at 03:08
  • 1
    If graphical evidence was equivocal, I would not use AIC to decide between apparently equally good (bad) fits. If graphical evidence was clear-cut, I would not look at AIC at all. Thinking about graphs does require some judgment but does allow experience to be used. That doesn't help the first time you do it, but everybody wobbles when learning how to ride a bicycle. These are my opinions. – Nick Cox Jan 16 '16 at 10:11
  • @NickCox Sorry to bother you again, I have some questions about these three statistics. Are the values the distance to the real data, so it means that the fewer, the better? Or should I compare the values with p-value 0.05? – Ling Zhang May 04 '16 at 14:46
  • @COOLSerdash Sorry to bother you again, I have some questions about these three statistics. Are the values the distance to the real data, so it means that the fewer, the better? Or should I compare the values with p-value 0.05? – Ling Zhang May 04 '16 at 14:49
  • What are "the values"? – Nick Cox May 04 '16 at 15:15
  • @LingZhang Do you mean the values resulting from `gofstat`? The upper three rows display the different statistics of the corresponding tests. The lower two rows display the AICs and BICs. For the AIC and BIC, lower is better. For the interpretation of the upper values, please consult this site, the documentation of the package "fitdistrplus" or a book. – COOLSerdash May 04 '16 at 16:39
  • @COOLSerdash year, I mean the values resulting from `gofstat`, especially these three values generated by the three different statistics. Furthermore, I email to the package author, it's the distance to real data, so lower is better. – Ling Zhang May 05 '16 at 00:04
  • @NickCox Sorry to make my question clear, I mean the values resulting from `gofstat`, especially these three values generated by the three different statistics. Furthermore, I email to the package author, it's the distance to real data, so lower is better. – Ling Zhang May 05 '16 at 00:06
  • @COOLSerdash sir, I have another questions now. In addition to `gamma` and `weibull`, I want to add `power law` and `power law with exponential cutoff` distributions to compare, Due to these two distributions are not the base probability function, I cannot use `fitdist` directly, so I should write the d,p,q function of these two distributions based on the example 4 of `?fitdist`, you could see more details about the question on [question](http://stats.stackexchange.com/questions/211966/how-to-get-parameters-of-power-law-and-power-law-with-exponential-cutoff-distrib) – Ling Zhang May 11 '16 at 14:39
  • @COOLSerdash I try to write the d,p,q function, but it dose not work, thank you for your help – Ling Zhang May 11 '16 at 14:43
  • @NickCox sir, I have another questions now. In addition to `gamma` and `weibull`, I want to add `power law` and `power law with exponential cutoff` distributions to compare, Due to these two distributions are not the base probability function, I cannot use `fitdist` directly, so I should write the d,p,q function of these two distributions based on the example 4 of `?fitdist`, you could see more details about the question on [question](http://stats.stackexchange.com/questions/211966/how-to-get-parameters-of-power-law-and-power-law-with-exponential-cutoff-distrib) – Ling Zhang May 11 '16 at 14:44
  • 1
    Ling Zhang: Please don't hassle people about threads you've started. That one looks off-topic in any case. (copy @COOLSerdash). – Nick Cox May 11 '16 at 16:28
1

Addition to the answer provided above, but using density and adding some theme bling:

library(fitdistrplus)
library(ggplot2)
library(ggthemes)

fit.gamma <- fitdist(df$y, "gamma", lower=c(0,0), start=list(scale=1,shape=1))
gammas<-round(rgamma(nrow(df), shape = fit.gamma$estimate["shape"], scale = fit.gamma$estimate["scale"]), 1)

gammas %>%
  as_tibble() %>%
  ggplot(aes(value)) +
  geom_histogram(stat = "density") +
  stat_function(fun = function(x) {dgamma(x, fit.gamma$estimate["shape"],scale=fit.gamma$estimate["scale"])}, color = "red") +
  lims(x = c(0, 15)) +
  theme_economist() +
  scale_color_economist()  

gamma_density