0

I have data with positive values ranging from 0 to 21 (min = 0, 1stQ = 0, Median = 2, Mean = 3.1, 3rdQ = 4, Max = 21), the distribution (using ggplot2::geom_density()) looks like this: enter image description here I know for a fact (based on the scientific literature for my research) that there is a substantial proportion of negative values but this data cannot be collected.

Since my actual data is constrained to positive values, how can I get an estimate of the distribution allowing for negative values?

Could adding a constant to each observation help find the shape of the distribution and then be used to model the negative values? (example data below)

library(tidyverse)
# Example data 
a <-rep(0, 59)
b <- rep(1, 31)
c <- rep(2, 23)
d <- rep(3, 20)
e <- rep(4, 10)
f <- rep(5, 9)
g <- rep(6, 6)
h <- rep(7,6)
i <- rep(8:21, by = 1)


df <- data.frame(config1 = c(a,b,c,d,e,f,g,h,i), 
                   config2 = c(a+2,b+2,c+2,d+2,e+2,f+2,g+2,h+2,i+2)) %>% 
  pivot_longer(cols= c(config1, config2) ,names_to = "config", values_to= "values")

# my actual distribution is "config1", adding a constant gives "config2"
p1<-df %>% 
  ggplot() +
  aes(x = values, fill = config) +
  geom_density(alpha = 0.4)
p1
CyG
  • 1
  • 1
  • Absolutely nothing can be said about the distribution of the negative values from this information alone. – whuber May 14 '21 at 11:57

1 Answers1

1

You can't with a nonparametric approach like kernel density estimation. KDE is a data-based estimate of the probability density function, the shape of the estimated distribution would depend completely on the data and the hyperparameters like the choice of the kernel, or bandwidth.

What you should use instead is probably a parametric model, assuming some kind of functional probability distribution and treating it as a truncated distribution, i.e. taking into consideration the fact that some of the values are not observed.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Would it be fair to paraphrase this as "if you make a huge, unverifiable assumption, you can produce any answer you want"? ;-) – whuber May 14 '21 at 12:17
  • 1
    @whuber yes, it is a good paraphrase. If you don't have grounds to make a reasonable choice about the distribution to fit, the result would be... – Tim May 14 '21 at 12:20