Fitting custom distributions by MLE

Question

My question relates to fitting custom distributions in R but I feel it has enough of a probability element to remain on CV.

I have an interesting set of data which has the following characteristics:

Large mass at zero
Sizeable mass below a threshold that fits a right-skewed parametric distribution very well
Small amount of mass at extreme values
A number of covariates that should drive the variable of interest

I was hoping to model this using a zero-inflated distribution approach, which is widely explored in the literature. Essentially, the density is:

$$f_{Y}(y)=\begin{cases} \pi \quad\quad\quad\quad\,\,\,\,\,\,,\,\,y=0 \\ (1-\pi)f_X(y),\,\,y>0 \end{cases}$$

This is easy enough to fit as is. However, I would like the mixing parameter $\pi$ to be dependent on the covariates $Z$ via a logistic regression:

$$\text{logit}(\mathbb{E}[\pi\,|\,Z])=\beta Z$$ where $\beta$ is a vector of coefficients for the covariates. Furthermore, because of the extreme-tail nature of my data, my distribution $f_{X}(y)$ fits best with an extreme-value approach:

$$f_{X}(y)=\begin{cases} f_{A}(y;a,b) \quad\,\,\,\,\,\,\,,\,y\leq \mu \\ (1-F_{A}(\mu))\cdot\text{GPD}\bigg(y;\mu,\sigma=\frac{(1-F_{A}(\mu))}{f_{A}(\mu)},\xi\bigg),\,y>\mu \end{cases}$$ where $\text{GPD}(y;\mu,\sigma,\xi)$ refers to the Generalized Pareto distribution, modelling the excess above a certain threshold $\mu$ and $f_{A}(y;a,b)$ is a given right-skewed distribution with scale and shape parameters $a$ and $b$, respectively. The above characterization ensures that the densities are continuous at $y=\mu$ (not differentiable, though) and that $f_{X}(y)$ integrates to 1.

In addition, I would ideally want the parameters of the above distributions to also depend on covariates:

$$f_{A}(y;a,b,\beta Z)$$ $$\text{GPD}\bigg(y;\mu,\sigma=\frac{(1-F_{A}(\mu))}{f_{A}(\mu)},\xi,\beta Z\bigg)$$

I realize that the above setup is quite complex but I was wondering if there is a way to derive the MLE estimates of each of the desired parameters by maximizing the likelihood function i.e. to obtain:

$$\hat{\xi}, \hat{a}, \hat{b}, \hat{\beta}$$

Is there an feasible/ideal way to go about this in R? Both in terms of my specific problem but also fitting custom distributions more generally?

The way you construct $f_X(y)$ by "cut and paste" means that $f_X(y)$ don't not integrate to one. You need to reformulate your model somehow to fix this, otherwise any estimates you might obtain will be meaningless. You also need to think about what parameters should depend on the covariates and what parameters should remain constant, perhaps after reparameterization in terms of means, variances and skews etc. of the different model parts. Apart from this, with good starting values you may be able to fit something like this using `optim` in R if you have enough data. — Jarle Tufto, Aug 25 '17 at 13:53
I agree with @Jarle Tufto, the $f_A(y)$ density-as specified in the post-is not guaranteed to sum/integrate to 1. The only way to ensure that happens is to choose $\mu$ accordingly and I don't think that is what the OP intended. — Lucas Roberts, Aug 25 '17 at 17:32
@StatsPlease -> do you want the distribution $f_A(y)$ to be continuous? The question doesn't specify, but I assume so because the GPD is continuous. Please confirm. — Lucas Roberts, Aug 25 '17 at 17:33
Is there a particular reason that you want a maximum-likelihood estimate rather than a Bayesian solution? — Jacob Socolar, Aug 27 '17 at 22:05
@JacobSocolar MLE was just my first attempt at the problem. I have no problem with other approaches if they're appropriate. — epp, Aug 27 '17 at 22:46
@JarleTufto I believe I have remedied the issue with the density. — epp, Sep 01 '17 at 13:08
@StatsPlease -> do you want an algorithm that works for ANY generic $f_A(y)$ or that works for a specific distributional family, say for example, Gamma(a, b)? — Lucas Roberts, Sep 01 '17 at 13:13
@LucasRoberts Depends on what is achievable. My first attempt would have been a Weibull or Pareto for $f_A(y)$ as these seem to fit well. — epp, Sep 01 '17 at 13:16
@StatsPlease -> No continuous distribution without a point-mass at 0 will work with your zero-inflation specification. The reason is that you will introduce a latent variable for the zero-inflation and some of those latent variables will need to be 0 and others 1. The 0 values indicate an observed 0 from the underlying and the a 1 indicates a structural 0. In the distributions discussed in the comments (Gamma, Weibull, Pareto) all have $\Pr(Y=0)=0$. So you will have a perfect separation problem in the logit model fitting that you've asked for in the post. — Lucas Roberts, Sep 02 '17 at 12:59
@LucasRoberts Thanks for taking the time to look at this. I'm not sure I follow what you're saying? I've clarified how I want to model the mixing parameter a little more, does this help? — epp, Sep 02 '17 at 13:26
@StatsPlease -> typically you'd introduce a latent variable (call it $w$) where $w=1$ denotes the $0$ point mass (in post denoted $\pi = \Pr(W=1)$). Then the probability would be $\Pr(W=1)$ and the overall probability of observing zero would be: $\pi + (1-\pi)\Pr(Y=0)$. Then in your setting you would write $\left(\frac{\pi}{\pi + (1-\pi)\Pr(Y=0)}\right)^W\left(\frac{(1-\pi)\Pr(Y=0)}{\pi + (1-\pi)\Pr(Y=0)}\right)^{1-W}$ as the Bernoulli portion of your probability/complete data likelihood. This is useful because for this you can then use the logit to get a formula for this part of likelihood. — Lucas Roberts, Sep 03 '17 at 00:20
@StatsPlease (cont from last comment) However, $\Pr(Y=0)=0$ in the distributions you specify in the comments, so you cannot get a likelihood for this if you use this framework. In order to have this work you need a non-zero $\Pr(Y=0)$. — Lucas Roberts, Sep 03 '17 at 00:22
@StatsPlease -> thinking about this a bit more -> you could get at what you want via a Tweedie model (https://en.wikipedia.org/wiki/Tweedie_distribution) and that will give you the 0-point mass as well as the continuous on positive values distribution, you could then do a tail mixture approach that you want based on a GPD mixture. You could then add a zero-inflation component on top of the Tweedie but that may not be necessary once you've used the Tweedie. — Lucas Roberts, Sep 20 '17 at 23:21
Can you post (a link to (part of)) your data, or a mockup, so we can experiment? — kjetil b halvorsen, Dec 20 '18 at 13:04
I think the most common approach would be do to something along the lines of a Tobit model. Of course this can generalize further to have categorical outcomes also above zero. The main idea is just to specify a continuous latent variable which you can model with any distribution you like, leading to easy maximum likelihood estimates. Just the observations are a function of this latent variable and not the variable itself. — user1587692, Mar 30 '19 at 09:49
Notation is unclear here, what is $\beta$? A matrix of coefficients that feeds into all the parameters at once $(\pi, a,b, \xi)$? What is $F_A$? Please clarify — Firebug, Apr 02 '21 at 22:47
If you're looking to have a model with predictors (a GLM-like model), then the marginal distribution of the response is not what you're fitting; it will be a mixture which depends on the distribution in x-space and the conditional distribution of the response.; the relevant choice for your model will be the conditional distribution of the response (so the 0-inflation will be important, but the remaining features you discuss might not be relevant). — Glen_b, Aug 10 '21 at 04:16

eric_kernfeld · Answer 1 · 2021-12-27T18:29:45.210

This answer assumes $\mu$ is known.

One very flexible way to get MLE's in R is to use STAN via rstan. STAN has a reputation for being an MCMC tool, but it also can estimate parameters by variational inference or MAP. And you're free to not specify the priors.

In this case, what you're doing is very similar to their hurdle-model example. Here is the STAN code for that example.

data {
  int<lower=0> N;
  int<lower=0> y[N];
}
parameters {
  real<lower=0, upper=1> theta;
  real<lower=0> lambda;
}
model {
  for (n in 1:N) {
    if (y[n] == 0)
      target += bernoulli_lpmf(1 | theta);
    else
      target += bernoulli_lpmf(0 | theta)
                  + poisson_lpmf(y[n] | lambda);
  }
}

To adapt this for your own use, you could:

Replace poisson_lpmf with the log-density for your $f_A$.
Add a third case to the if-else so that it checks for exceeding $\mu$, not just 0. As the meat of that third case, use the log pmf for your extreme value distribution of choice.
Replace bernoulli_lpmf with categorical_lpmf and make the mixture probability parameter into a vector.
To incorporate covariates, you can add regression parameters, and make all your other parameters functions of them. It may help to use categorical_logit_lpmf in place of categorical_lpmf.
Truncate one mixture component at $\mu$ from above and the other at $\mu$ from below, depending on your perspective on the dilemma raised by Jarle Tufto in the comments. It seems like you could get VERY different estimates depending on how exactly you decide to handle this. A nice sanity check: generate a fake dataset from the fitted parameters and make sure it has the right amount at 0, amount above $\mu$, etc.

Once you have a file with the right STAN code, you can use STAN with lots of different toolchains. To use it with R, check out these examples. I simplified one to get an MLE, using rstan::optimizing instead of sampling:

install.packages("rstan")
library("rstan")
model = stan_model("Example1.stan")
fit = optimizing(model)

There are also some tricks for faster/better optimization that could help in practice.

If one uses `optimizing`, what is the underlying optimization that is performed? Is it something similar to gradient descent as in 'optim'? — Sextus Empiricus, Dec 27 '21 at 19:14
It's L-BFGS by default. BFGS and Newton's method are available. — eric_kernfeld, Dec 28 '21 at 02:43

score 2 · Answer 2 · answered Dec 27 '21 at 18:22

My STANswer is so complex that it's just begging for something to go wrong. Here's a simpler way: do all of your inference conditional on the (known) facts of whether each datum exceeds 0 and whether each datum exceeds $\mu$. In other words, reduce the data to:

The set of observations $S_1 \equiv \{y_i: 0<x_i<\mu\}$.
The set of observations $S_2 \equiv \{y_i: \mu<x_i\}$.
The number of zeroes $N_0$.
Let $N_1 \equiv |S_1|$ and $N_2 \equiv |S_2|$.

Then maximize, separately:

$(1-F_A(\mu))^{N_1} \prod_{y \in S_1} f_A(y) $ w.r.t. parameters of $f_A$.
$\prod_{y \in S_2} GPD(y)$ w.r.t. parameters of your gross promestic doduct (GPD).
$\pi^{N_0}(1-\pi)^{N_1 + N_2}$ w.r.t. $\pi$ .

This doesn't seem to do quite what you ask because:

$\mu$ must be user-specified.
the GPD scale parameter is not fixed to $\frac{f_A(\mu)}{1-F_A(\mu)}$. Hopefully that part is not essential for interpretability. If it is, maybe it would be good enough to just fix $\sigma$ based on the results of bullet point 1, then optimize the remaining parameters. It's no longer a joint MLE then. There's no way to uncouple the optimization if you are really dead set on that.

Fitting custom distributions by MLE

2 Answers2