11

I would like to fit a model to a dependent variable distributed like the one below (see picture).

enter image description here

The distribution is a count of people (with specific characteristics) in various districts. This means that, there are no negative numbers; in the great majority of districts the variable is 0, but there also exists very large numbers (up to 80,000) with very low frequency.

Following Moti Nisenson's advice, I edit some graphs into this post that make the distribution clearer. If I drop all 0, the graph looks the same because there are a lot of 1's, 2's, etc.

If I drop all < 100, it looks like this:

enter image description here

If I drop all < 1000, it looks like this:

enter image description here

If I drop all < 5000 it looks like this:

enter image description here

My goal is to find a regression that does well in predicting the zeros and, more importantly, the extreme values in the right tail of the distribution.

I understand that Ordinary Least Squared is not ideal here. I have looked into Poisson regressions which seem to be a great deal more adequate for my purposes.

Is there any regression model that is even more appropriate? Which else options might be helpful?

Additional edit: These are the summary stats. The Variance is much (much) higher than the mean which according to this source is a sign that Poisson is not appropriate. enter image description here

Additional edit 2: Here is the distribution of the variable in log as requested.enter image description here

eigenvector
  • 211
  • 1
  • 2
  • 7
  • 2
    Can you look at a log-version of the above graph or can you just show the graph without the zeros? You can't see what goes on with the tail with the current image. – MotiNK May 22 '17 at 10:41
  • 1
    @Moti Nisenson: Good point. Now included. – eigenvector May 22 '17 at 10:56
  • 1
    Based on you updated histograms I don't think my answer is useful. Also I assumed that you want to develop a model for prediction. But you want to model the distribution itself. Can you please also put the Log version of the histogram as @MotiNisenson suggests? – Hooman May 22 '17 at 12:11
  • 2
    I have added the tag [zero-inflation]. You may find some of the Q&A there helpful. – mdewey May 22 '17 at 12:15
  • @Hooman I included the log version, although in this case it seems not to be particularly helpful. If there is any else information I should include, please let me know. And thank you, mdewey, for the tip. I will directly look into it. – eigenvector May 22 '17 at 12:21
  • This Q&A https://stats.stackexchange.com/questions/279273/zero-inflated-distributions-what-are-they-really looks as though it would provide a useful starting point. – mdewey May 22 '17 at 14:14

1 Answers1

6

If the number of counts where Count $\neq$0 is small then you can just handle this as a classification problem. Otherwise you can firstly separate the data based on the target variable into two groups:

1- Count ==0

2-Count $\neq$0

You can use a classification method (for example a logistic regression) to model each of the above outcomes. Then in the group where Count $\neq$0 you can fit a regression model.

why this would help:

  • Balancing the data: If you fit a regression model with least square it would be heavily biased towards 0, as most of your data is located at count==0. When you separate your data into two groups then all the data with Count $\neq$0 are put into one bin and they will have more weight against Count $\neq$0.

  • Based on distribution count==0 seems to be a case completely distinct from other counts. Hence it can help if you treat is differently by first separating the data where count==0.

Hooman
  • 720
  • 6
  • 11
  • Good idea to separate the data into two prediction problems. I guess this would work very well if the data != 0 was normally distributed. You can see from the edited graphs, though, that there are a lot of small values (1-100) other than 0. So I guess, the approach would not help here. Please correct me if I am wrong. – eigenvector May 22 '17 at 12:11
  • I think you are right. Based on your initial histogram I assumed that data!=0 is uniformly distributed. – Hooman May 22 '17 at 12:16
  • Yes, not your mistake. I was a bit late with the additional information. And thanks anyway. – eigenvector May 22 '17 at 12:16