I have a dataset containing some hundreds of thousands of observations, out of which some small number contain an event of interest x
. Let's say that my total dataset is large enough that I have a decent confidence in the overall frequency of x
.
But what I'm really interested in is the frequency of x
together with some other condition y
, and the frequency of y
in the dataset is much lower. The total number of observations of y
doesn't give me enough data to make a confident prediction about how well it correlates with x
, and the actual number of observations of x+y
is often zero, even though the theoretical frequency of x+y
must be something larger than zero.
So how can I estimate the true probability of x+y
, given the overall frequency of x
in the data set and the small-ish number of instances of y
that I have?
Edit: I know that x
and y
are not independent, but at the outset I don't know anything about the nature of the relationship between them. The entire point of the exercise is to determine whether they have a positive or negative correlation.
Sorry, I know next to nothing about statistics and I don't know what the proper terminology is to describe this situation.