2

I am working on a project with a set of data where people have a certain quality of interest, say "red hair." My goal is to estimate the probability that an individual has red hair. Normally, we'd use some model (logistic regression, random forest, etc.) that would use the data to regress on the different covariates.

My data has two values: 1 and NA. A value of 1 indicates that that individual does have red hair. The NA, however, only indicates that we do not have data - it does not necessarily mean that that individual does not have red hair. (If it makes a difference, an NA value indicates it is likely that the individual does not have red hair since red hair is a relatively rare trait, but otherwise we don't have any information.) I also have standard demographic information (sex, age, etc.) available as covariates that I'd optimally use to predict red-headedness.

Understanding that most (all?) models aren't equipped to handle independent variables with no variability, I wanted to ask a question: How would you try to estimate the probability that a given individual has red hair with the data mentioned above?

Any thoughts, discussions, articles, or creative approaches are welcome!

Matt Brems
  • 2,588
  • 1
  • 11
  • 14
  • Do you have population values? I.e. $\mu \pm 2\sigma = 0.05 \pm 0.025$ proportion of individuals in the population at large have red hair? – Chris C Oct 29 '15 at 15:47
  • 2
    "red hair is a relatively rare trait, but otherwise we don't have any information" - this sounds like a textbook example for a Bayesian analysis, which *can* deal with all-zero or all-one data. – Stephan Kolassa Oct 29 '15 at 15:53
  • Stephan, is there a technique in particular that you might suggest? My training in Bayesian analysis is pretty patchy, so any direction or reference would be helpful! – Matt Brems Oct 29 '15 at 15:57
  • You can calculate a confidence interval of the form $[p_l,1]$, that is, a lower bound. See ??? OK, I'm sure has been discussed on this site, but cnnot find it here: http://www.lexjansen.com/nesug/nesug13/41_Final_Paper.pdf http://stats.stackexchange.com/questions/82720/confidence-interval-around-binomial-estimate-of-0-or-1 – kjetil b halvorsen Oct 29 '15 at 16:14
  • I *think* (my Bayes is pretty spotty, too) that there is an example about Bernoulli trials with beta priors in Gelman et al., [*Bayesian Data Analysis*](http://www.stat.columbia.edu/~gelman/book/), but I don't have it at hand. – Stephan Kolassa Oct 29 '15 at 20:19
  • Ha! Luckily for me, that's the book I have. Thanks, Stephan! – Matt Brems Oct 29 '15 at 21:53

0 Answers0