7

I have a dataset comprised of continuous values that have about 30-50% zeros and a large range (10^3 - 10^10). I believe these zeros are not a result of missing data and are the result of the sensitivity of the machine taking the measurements. I would like to log10 transform this data so I can look at the distribution, but I'm not sure how to handle the zeros

I've done a lot of searching and found the following

  1. Add a small constant to the data like 0.5 and then log transform
  2. something called a boxcox transformation

I looked up boxcox transformation and I only found it in regards to making a regression model. I just want to visualize the distribution and see how it is distributed.

Currently when I plot a historgram of data it looks like this

enter image description here

When I add a small constant 0.5 and log10 transform it looks like this

enter image description here

Is there a better way to visualize the distribution of this data? Im just trying to get a handle on what the data looks like in order to figure out what kind of tests are appropriate for it.

Brett Phinney
  • 413
  • 1
  • 4
  • 11
  • 5
    Some closely related threads provide several good answers to all your questions: http://stats.stackexchange.com/questions/30728, http://stats.stackexchange.com/questions/1601, http://stats.stackexchange.com/questions/24227, and http://stats.stackexchange.com/questions/41361. More can be found by searching for [log transform](http://stats.stackexchange.com/search?tab=relevance&q=%20log%20transform). If you wish to follow up, just edit your question to focus on any remaining issues that need to be addressed. – whuber Jan 30 '14 at 21:09

2 Answers2

5

Do you know what the sensitivity of the machine is? If it cannot reliably record any values less than 100 (and therefore reports them as 0), then that means all your 0's are values between 0 (or negative infinity) and 100, adding 0.5 would underestimate this, 50 would be a more reasonable value, or possibly 100. It would make the most sense to choose the added value (and maybe only add it to the 0's, not all the values) based on the machine precision.

There are also ways to estimate the value to be added that gives the "Best" normal approximation in the data (I think there was some of this in the original Box-Cox paper), or a logspline fit can be used to estimate a distribution with your zeros being treated as interval censored values.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
  • 1
    Thanks for the info. Unfortunately the sensitivity is related to what it is measuring and it is measuring thousands of different things during the analysis. So essentially each row has a different LOD which is unknown. To make matters worse I'm not even sure all the zeros really = below the limit of detection. There is a chance they are really missing values because the machine does not sample fast enough to catch everything – Brett Phinney Jan 30 '14 at 23:29
2

In your case, I would treat zeros separately from the other data points. You can work out a model for non-zero elements. Adding a small value $\epsilon$ at least works for data visualization purpose. Btw. there was an almost similar discussion before here:

How should I transform non-negative data including zeros?

omidi
  • 1,009
  • 7
  • 12