2

I am reading the source code of MAGeCK 0.5.9.5 which is a popular software for analyzing CRISPR knockout screen data. In its mageckMathFunc.py, it calculates "mean" and "variance" as follows:

def mmedian(lst):
  sortedLst = sorted(lst)
  lstLen = len(lst)
  if lstLen==0:
    return 0.0
  index = (lstLen - 1) // 2

  if (lstLen % 2):
    return sortedLst[index]
  else:
    return (sortedLst[index] + sortedLst[index + 1])/2.0

def getMeans(matt):
  meanvalue=[mmedian(v) for v in matt]
  return meanvalue

def getVars(matt):
  matt=list(matt)
  meanvalue=getMeans(matt)
  varvalue=[ sum([ (kj-meanvalue[i])*(kj-meanvalue[i])  for kj in matt[i]  ]  )/max(float(len(matt[i]))-1,1.0)    for i in range(len(meanvalue))]
  return varvalue

Then it uses their return values to use as sample mean and sample variance extensively throughout the code. Apparently, they are not the conventional mean and variance based on simple arithmetic mean. Is there a name for this type of "sample mean" and "sample variance" based on median? Under what circumstances should we use them instead of the conventional ones? Thank you very much in advance.

ymc
  • 21
  • 1
  • 2
    This looks weird. If the median is used as a location parameter, it is more common to use the inter quartile range (IQR) as a measure for dispersion. – cdalitz Dec 23 '21 at 06:00
  • It is indeed weird. I try to write R code to reproduce the MAGeCK numbers. I must use this definition of variance to get the numbers right. – ymc Dec 23 '21 at 08:26
  • 2
    This would work with large datasets having no outliers. It makes no statistical sense, though, because (a) one uses the median to obtain a location estimate that is resistant to outlying values but (b) using the squared deviations from *anything* is even *more* sensitive to outliers than the arithmetic mean that was avoided in the first place! One way to understand this calculation is that it returns the sum of the (usual) variance and the squared difference between the mean and median. As such, it estimates a somewhat complicated quantity describing both *spread* and *skewness.* – whuber Dec 23 '21 at 16:11

1 Answers1

2

Comment: I am not sure of the formula for your variance estimate or the purpose in using something other than the usual sample variance $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2$, where $\bar X$ is the sample mean.

Perhaps this page linked in the margin is relevant.

Also, perhaps consider the function mad for Median Absolute Deviation in R. Roughly, it is the median of the absolute deviations from the sample median.

MAD has been proposed as a robust measure of variability.

For a normal sample, when mad is multiplied by $1.4826.$ (the default in R), its expected value is the normal standard deviation $\sigma.$

set.seed(1234)
x = rnorm(10000, 100, 15)
mad(x)
[1] 14.79603  # aprx 15
sd(x)
[1] 14.81294
BruceET
  • 47,896
  • 2
  • 28
  • 76
  • 1
    Thanks for your reply. In the code, it uses the median and the "variance" relative to median to find outliers by setting a cutoff of median+4*sqrt("variance" relative to median). It just uses them as if they are the conventional sample mean and sample variance. In order to reproduce the numbers in the output of MAGeCK, I have to use this definition of variance. I am aware of MAD but not this. I am not sure about the rationale behind it, so I try to see if anyone here knows. – ymc Dec 23 '21 at 08:25