3

I am a programmer, not a statistician, so pardon my botched use of the terms. My basic problem is this: I am wanting to calculate $R^2$ between a known concentration (which can be any non-negative value) and a discrete measurement (where the values are all integers). There are 92 possible observations, and each of these 92 has a known actual concentration. We wish to measure the accuracy of the measurements by looking at the $R^2$.

The different cases (each a different molecule) have concentrations which vary across orders of magnitude, so we use the log (base 2) of the values when calculating the $R^2$. This is an industry standard convention for this case.

In most cases, this works well. However, in cases where the measurements are relatively low, I am thinking that it may cause an artificially low $R^2$. For example, if the known concentration should result in a measurement of 0.1, since the only measurements possible are 0 or 1, then it will interpret this as error. Even if there are around 10 cases with concentrations that should give a measurement around 0.1, and we get one detection of one of them and 0 for the other 9, this will be interpreted as error.

For higher concentrations, this is obviously less of an issue (if the detections should have been 118.5 and we got either 118 or 119, it will correctly interpret this as not much error). However, I don't want to just make up my own correction for this.

My guess is there is some standard way of handling the calculation of $R^2$ between a continuous and a discrete variable. Can you point me at it?

I'm doing my calculations in Python using the scipy.stats module, but if you just know the name of the proper calculation method and don't know the python code that's perfectly fine.

p.s. To be more clear, there are 92 molecules of known concentration, and we are measuring their concentration using a technique. We want to know if a given measurement run went ok, and so the $R^2$ of the measurement run (which is discrete, i.e. how many counts do you have for that molecule) vs. the known true concentration (which is continuous) is being used to determine how this run's accuracy compares to other runs (for the exact same set of molecules). Hopefully the fact that it is always the same 92 x axis values (where y is the measurement), and we are comparing only one measurement run of this type to another, makes $R^2$ not too bad a metric to use here.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
rossdavidh
  • 490
  • 4
  • 11
  • Unfortunately, R-squared does not measure accuracy! (Many threads on this site address this question, such as http://stats.stackexchange.com/questions/13314/.) For an introduction to some of the issues in calibration one faces even before dealing with the discreteness issue, see http://www.ltrr.arizona.edu/~jburns/Articles%20-Read/masslav.pdf. The regression setting described there can be generalized to discrete responses using a generalized linear model (perhaps for a Poisson distribution, depending on how your instrument works). – whuber Jul 03 '12 at 20:47
  • Oh dear. As a non-statistician, I am going to have a hard sell convincing folks that the standard way of measuring how well a measurement correlates to the known values, is not what they want. If I am forced to use some variant of R-squared for this, is there a way to address the discreteness issue? – rossdavidh Jul 03 '12 at 21:03
  • 1
    It's a gene sequencer. It is producing counts of how many times the molecule in question was found, and this should correlate to the molecule's concentration in the original sample. So not explicitly rounding. – rossdavidh Jul 03 '12 at 21:13
  • 1
    Re the edit: Yes, it's reasonable to hope that R-squared can be used to compare runs based on the same 92 $X$ values. However, it must be understood that R-squared is an average over the entire calibration curve. It is possible, for instance, to really screw up the curve at the low end (which could hugely change detection and quantification limits) while improving it in the middle (such as making it more linear there) and in the balance the R-squared might not detect that. – whuber Jul 03 '12 at 21:14
  • Good to know, thanks! Any ideas on the discreteness issue? – rossdavidh Jul 03 '12 at 21:29
  • I think a Poisson GLM with identity link should work well--but note that R-squared is not appropriate for comparing two such models! I would like to hold back on elaborating this answer, in the hope that someone with experience analyzing gene sequencer data will contribute a reply. Depending on the data, the discreteness may be a non-issue. – whuber Jul 03 '12 at 22:23
  • @whuber, just out of curiosity, why do you recommend an identity link instead of a log link here? – gung - Reinstate Monica Jul 03 '12 at 23:00
  • @gung For a true concentration $X$ the instrument's count $Y$ should be directly proportional to $X$ and have a Poisson distribution, whence our model is $Y\sim\mathrm{Poisson}(\beta_0+\beta_1X)$: that's a Poisson GLM with identity link. The "intercept" $\beta_0$ may be needed to model a low-level "background" response: the "noise" that emerges when $X=0$. – whuber Jul 04 '12 at 17:53

0 Answers0