3

I have a bivariate data set where x and y values, both continuous variables, correlate well (visually speaking) when x and y are small. As the values of x and y increase, the correlation decreases to an eventual random state when x and y are large. My question is, given a y value, I would like to know the probability of the value of a given x. Should I generate a probability distribution for this data?

I would very much appreciate any practical advice for Python in specific.

chl
  • 50,972
  • 18
  • 205
  • 364
user1728853
  • 937
  • 1
  • 8
  • 7
  • For any continuous distributions, the probability of any specific value is infinitesimal. – Peter Flom Oct 10 '12 at 21:14
  • Perhaps my question is not framed correctly. Imagine if for all x values where x=1, all y values were 10. Based on this, I'd like to say, that given the distribution of the data, if y=1 the probability of x=10 is 100%. – user1728853 Oct 10 '12 at 21:20
  • Then the data are not continuous. – Peter Flom Oct 10 '12 at 21:21
  • Can I not draw a value of 1 from continuous data? – user1728853 Oct 10 '12 at 21:23
  • 1
    Not 1 exactly, not for truly continuous data; or, rather, the chance of getting 1 exactly is infinitesimal, but not 0. It gets into a question of integrals vs. summation signs. You can say things about "1 or less", if you like. – Peter Flom Oct 10 '12 at 21:25
  • In addition to @PeterFlom's points, I wonder if you are thinking that estimation of X values given Y & estimation of Y values given X are symmetric. Intuitively, it seems like they should be, but they aren't. You may find this interesting: [What is the difference between doing linear regression on y with x versus x with y?](http://stats.stackexchange.com/questions/22718//22721#22721). – gung - Reinstate Monica Oct 10 '12 at 22:18

1 Answers1

4

You might first explore fitting the data to a bivariate normal model. And you'll want to evaluate whether a bivariate model is sensible of course given your note about the decreasing correlation with larger values. That's hard to interpret without more detail - if you have a lot of data for those higher regions then bivariate might not be a good idea, but if the data is sparse then you might remove for a cleaner model.

I suggest trying a bivariate normal model because it has several simplifications which will support your goal:

  • You can fit the model by calculating just 5 simple parameters: $\mu_X$, $\mu_Y$, $\sigma_X$, $\sigma_Y$, and the correlation coefficient $\rho$.
  • And for a bivariate distribution the conditional distribution is itself a normal distribution. Your core question -- how to state that if $y = \text{<some value>}$ then the probability of $x$ being in a range of values -- requires conditional distributions, and a normal conditional distribution is easily calculated from a bivariate normal model:
    $$ \begin{aligned} E[X | Y = y] &= \mu_X + ( \rho * (y - \mu_Y) * (\sigma_X / \sigma_Y)) \\ \text{Var}[X | Y = y] &= \sigma_X^2 * (1 - \rho^2) \end{aligned} $$
  • With the conditional expectation and the conditional variance you now have the parameters for the (conditional) normal distribution - your new $\mu = E[X | Y=y]$ and your new stdev is equal to $\sqrt{\text{Var}[X | Y=y]}$
  • And Peter Flom's point is spot on, its more intuitive to talk about the probability of $X$ being in a range of values, so specifically you want to use the cumulative form of the normal distribution for your conditional distribution.

Regarding the python that's harder to say without knowing what you're using - but I suspect if you're able to plot out and examine your data to draw those conclusions then you also have basic statistical commands (via scipy / numpy) and can easily calculate the 5 parameters mentioned above. With that it's just a matter of calculating the two formulas above.

From there use something like scipy.stats to calculate the cumulative normal distribution values for given $\mu$ and stdev. Don't try to program that function manually - it's doable but requires some rather arcane formulas!

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
thomas
  • 361
  • 1
  • 4
  • +1, nice answer! Welcome to the site, @tabSF. You'll be happy to know that CV supports $\LaTeX$ via mathjax. I took the liberty of tweaking your answer for greater readability; make sure it still says what you want. Since you're new here, you may want to read our [FAQ](http://stats.stackexchange.com/faq) as well, which covers the markup options available (among other things). – gung - Reinstate Monica Oct 11 '12 at 02:34
  • @gung, thanks for the tip and applying the latex! I will definitely check out the FAQ -- am very excited to discover this site & community. – thomas Oct 11 '12 at 02:51
  • Thanks so much! Very helpful. I do have a lot of data in the higher regions. I'll look into the CDF suggestion. – user1728853 Oct 11 '12 at 12:07