4

I have been trying to find a method to analyse variance on Weibull and/or Gamma distributions but a Google search for

anovar  Weibull "gamma distribution"

yields nothing helpful. The data I have cannot be fitted to a normal distribution but fits a Weibull or Gamma distribution quite well.

OtagoHarbour
  • 175
  • 1
  • 9
  • 1
    Better to search for "ANOVA" than "anovar". Does this help? http://www.iaeng.org/publication/IMECS2010/IMECS2010_pp2051-2056.pdf – Stephan Kolassa Mar 28 '14 at 21:46

1 Answers1

2

Besides the issue of the misspelling of 'anova' that Stephan mentioned, I have a few points.

1) I'd also suggest inserting "|" (for "or") in your search, since otherwise it will be treated more like "and"

2) further, the word 'distribution' goes as much with Weibull as it does with gamma. So I suggest a search like so: anova weibull|gamma distribution.

3) further still, in the case of the gamma, an ANOVA-type model would normally be fitted using GLMs, so you may prefer to search on gamma GLM

4) Parametric Weibull models are often available under options relating to survival analysis in many statistics packages; while survival models often have censored data, they don't have to, so a Weibull model with grouping-factors as IVs ("ANOVA-like") models can often be fitted that way.


The usual way to compare gamma means would be via a GLM.

This has the underlying assumption of equal shape parameters (much as ordinary ANOVA carries the assumption of equal variances).

This assumption can be assessed, for example, either visually (by looking at whether they seem to have similar shapes), or by finding MLEs of the shape parameters of the groups being compared. If the values are not too dissimilar then this approach should work fine.

[Similarly, a comparison of Weibull means might be achieved by treating the values as (uncensored) survival times in a survival model.]

On the other hand if it's not expected that the gammas are at least reasonably similar in shape, it's more complicated; one might try to form a confidence interval for the difference in means in any of several ways. In large samples, one might try bootstrapping, for example, or the distribution of a ratio of ML estimates of the mean (difference in logs) might be approximated or even assessed via simulation.


For comparison purposes, the following can be done in R (the data set is built in):

summary(lm(weight~feed,chickwts)) # Linear regression model for one way model 

summary(glm(weight~feed,family=Gamma(link="identity"),chickwts)) # Gamma equivalent

summary(glm(weight~feed,family=Gamma(link="log"),chickwts)) # gamma with log-link

summary(survreg(Surv(weight)~feed,data=chickwts))  # Weibull model

anova(...) can be used in place of summary(...) in those calls to obtain other information.

The first model is an ordinary one-way anova type model.

The second is the equivalent using a gamma model (in this case with essentially identical parameter estimates to the first model but different standard errors).

The third model is also a gamma anova-type model but where the parameters describe effects on the log scale (a test of the model is still a test for differences in means on the original scale, though).

The fourth model is a Weibull model, which has log-scale parameter estimates (which can be compared with the third model).

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • I understand this a bit better now. My problem is that the sample sizes of the distributions are quite different. One has 34 observations while the other has 2465 observations. Thanks, – OtagoHarbour Apr 04 '14 at 21:55
  • It's not clear to me what you're trying to achieve. Can you clarify what your variables and problems are? – Glen_b Apr 05 '14 at 00:30
  • Sorry about my slow reply. I have two gamma distributions with different sample size. I have managed to determine their shape and scale and from that their mean and SD. What I would like to do is to determine the p value reflecting the significance of the difference in the means. Thanks, – OtagoHarbour Apr 06 '14 at 12:28
  • This doesn't clarify in the sense I asked for, but simply seems to repeat the same information as before. If you know their shape and scale (as suggested by 'determine' rather than 'estimate') then there's no need to test at all. If you have the population values, you just compare them; if they differ, they differ. Why perform inference relating to population values if you have them already? – Glen_b Apr 06 '14 at 17:20
  • If the means are slightly different but their SDs are huge would that not mean the means are different but not significantly so? – OtagoHarbour Apr 06 '14 at 20:37
  • If you're talking about *population* values rather than estimates, what does 'significant' mean? It can't be in the statistical sense because parameters aren't random variables. On the other hand, if you do mean you have estimates, and not parameters ... then why are you estimating gamma parameters the way you are (from moments) rather than fitting a glm, say? – Glen_b Apr 06 '14 at 22:20
  • This is the puzzle that leads me to again repeat the request for the kind of clarification I asked about before (what are your variables, and what underlying problems are you trying to solve -- i.e. what is your data and what are you trying to ask of it) – Glen_b Apr 06 '14 at 22:26
  • My data is two different sized vectors of real numbers. (For priority reasons, I cannot give their source.) I fit gamma distributions in R using gr=fitdistr(data+0.00001,"gamma") and then get the mean with gr[1]$estimate['shape']/gr[1]$estimate['rate'] and the SD with sqrt(gr[1]$estimate['shape']/(gr[1]$estimate['rate']^2)). So I should have said they were estimates. I want to see if they are significantly different and their p-values. Thanks, – OtagoHarbour Apr 07 '14 at 12:08
  • 1
    That helps *a lot*. The 0.00001 is because you have exact 0's? Is there a spike of values at 0 or is this just rounding? Note that `fitdistr` is already giving you parameter estimates, that's where you'll want to start. A test based on the likelihood ratio may require also combining the two data sets and fitting that. Alternatively, with large samples you might base a test off treating the estimates as asymptotically bivariate normal, but this would require having an estimate of the covariance of the two parameter estimates for each fit, not just their standard error. – Glen_b Apr 07 '14 at 22:15
  • Note that what we're discussing reads very differently from what you seemed to be asking originally. Analysis of variance is a comparison of *means*, not of parameter vectors, and my original post discusses how to do that. If your question is actually about comparing means, this discussion is a sidetrack. If it's about comparing distributional fits (both parameters simultaneously), then your question should reflect that. – Glen_b Apr 07 '14 at 22:19
  • Thank you very much for your help. It is comparison of mean estimates that I am interested in. Thanks again, – OtagoHarbour Apr 08 '14 at 14:17
  • In that case, I'll add some discussion to my answer to cover some points I didn't before. – Glen_b Apr 08 '14 at 21:13
  • I meant to say that the 0.00001 was because I had exact zeros although there was not a spike at zero. Thanks, – OtagoHarbour Apr 09 '14 at 12:38
  • Exact zeroes would preclude both the gamma and the Weibull if they're not just due to rounding or being below detection limits (though in either case those might be better handled explicitly) – Glen_b Apr 21 '14 at 17:14
  • If the mode of the histogram is at exactly zero, would an exponential function be the best function to fit (assuming the histogram falls off exponentially)? Thanks, – OtagoHarbour Apr 28 '14 at 17:28
  • 1
    There are an infinite number of continuous functions with a mode at 0. Hell, there's an infinite number of different *gamma* distributions with a mode at zero. The exponential is the least skew of those. If, as you say, the density actually falls off exponentially, that's the definition of an exponential! But [beware judging density from a single histogram](http://stats.stackexchange.com/questions/51718/assessing-approximate-distribution-of-data-based-on-a-histogram). Appearances can be misleading! – Glen_b Apr 28 '14 at 23:14
  • 1
    ... (ctd) If you're reasonably satisfied that it's approximately exponential, an exponential may be a reasonable choice. What is best would depend on as yet unstated criteria. – Glen_b Apr 28 '14 at 23:19