Fitting a model to a variable with many zeros and few but large values in right tail

Question

I would like to fit a model to a dependent variable distributed like the one below (see picture).

The distribution is a count of people (with specific characteristics) in various districts. This means that, there are no negative numbers; in the great majority of districts the variable is 0, but there also exists very large numbers (up to 80,000) with very low frequency.

Following Moti Nisenson's advice, I edit some graphs into this post that make the distribution clearer. If I drop all 0, the graph looks the same because there are a lot of 1's, 2's, etc.

If I drop all < 100, it looks like this:

If I drop all < 1000, it looks like this:

If I drop all < 5000 it looks like this:

My goal is to find a regression that does well in predicting the zeros and, more importantly, the extreme values in the right tail of the distribution.

I understand that Ordinary Least Squared is not ideal here. I have looked into Poisson regressions which seem to be a great deal more adequate for my purposes.

Is there any regression model that is even more appropriate? Which else options might be helpful?

Additional edit: These are the summary stats. The Variance is much (much) higher than the mean which according to this source is a sign that Poisson is not appropriate.

Additional edit 2: Here is the distribution of the variable in log as requested.

Can you look at a log-version of the above graph or can you just show the graph without the zeros? You can't see what goes on with the tail with the current image. — MotiNK, May 22 '17 at 10:41
Based on you updated histograms I don't think my answer is useful. Also I assumed that you want to develop a model for prediction. But you want to model the distribution itself. Can you please also put the Log version of the histogram as @MotiNisenson suggests? — Hooman, May 22 '17 at 12:11
I have added the tag [zero-inflation]. You may find some of the Q&A there helpful. — mdewey, May 22 '17 at 12:15
@Hooman I included the log version, although in this case it seems not to be particularly helpful. If there is any else information I should include, please let me know. And thank you, mdewey, for the tip. I will directly look into it. — eigenvector, May 22 '17 at 12:21
This Q&A https://stats.stackexchange.com/questions/279273/zero-inflated-distributions-what-are-they-really looks as though it would provide a useful starting point. — mdewey, May 22 '17 at 14:14

Hooman · Answer 1 · 2017-05-22T10:49:29.167

6

If the number of counts where Count $\neq$0 is small then you can just handle this as a classification problem. Otherwise you can firstly separate the data based on the target variable into two groups:

1- Count ==0

2-Count $\neq$0

You can use a classification method (for example a logistic regression) to model each of the above outcomes. Then in the group where Count $\neq$0 you can fit a regression model.

why this would help:

Balancing the data: If you fit a regression model with least square it would be heavily biased towards 0, as most of your data is located at count==0. When you separate your data into two groups then all the data with Count $\neq$0 are put into one bin and they will have more weight against Count $\neq$0.
Based on distribution count==0 seems to be a case completely distinct from other counts. Hence it can help if you treat is differently by first separating the data where count==0.

edited May 22 '17 at 10:49

answered May 22 '17 at 10:36

Hooman

720
6
11

Good idea to separate the data into two prediction problems. I guess this would work very well if the data != 0 was normally distributed. You can see from the edited graphs, though, that there are a lot of small values (1-100) other than 0. So I guess, the approach would not help here. Please correct me if I am wrong. – eigenvector May 22 '17 at 12:11
I think you are right. Based on your initial histogram I assumed that data!=0 is uniformly distributed. – Hooman May 22 '17 at 12:16
Yes, not your mistake. I was a bit late with the additional information. And thanks anyway. – eigenvector May 22 '17 at 12:16

Fitting a model to a variable with many zeros and few but large values in right tail

1 Answers1