1

I have a count dataset (num_samples=7, num_attributes=14117) that I want to normalize (for lack of a better word). Each of the samples has a different number of counts. Ultimately, I want to get rid of noise, scale everything onto the same level, and then create a correlation-based network. This question focuses on a single vector from that (n=14117)). This dataset has a lot of zeros (3236 zero values) and low values so I thought dropping all values that were $<=$ 30.

MANY papers looking at similar datasets have used z-score normalization (mean centered divided by standard deviation) but these counts datasets are far from normally distributed. I've heard of zero-inflated poisson models for count data but, even then, this data isn't poisson distributed.

Is there a method analogous to z-score normalization for skewed datasets with lots of low values? It wouldn't make sense to transform the data into units of standard deviation if the data isn't normally distributed. Is the answer as simple as cutting the noise and the pushing all values in between a certain interval? enter image description here

u = DF_data.iloc[0,:]
v = u.compress(lambda x:x >= 30)
w = np.log(v)

# Plot data
fig, ax = plt.subplots(figsize=(15,3), ncols=3)

u.plot(kind="kde", ax=ax[0])
v.plot(kind="kde", ax=ax[1])
w.plot(kind="kde", ax=ax[2])

# Plot mean
ax[0].axvline(x=u.mean(), linestyle="-", color="black", linewidth=1)
ax[1].axvline(x=v.mean(), linestyle="-", color="black", linewidth=1)
ax[2].axvline(x=w.mean(), linestyle="-", color="black", linewidth=1)

# Labels
ax[0].set_title("raw data (n=14,117 attributes)")
ax[1].set_title("raw data >= 30 counts (n=3,846 attributes)")
ax[2].set_title("log transformed >= 30 counts (n=3,846 attributes)")

# Limits for plot
ax[0].set_xlim((0,ax[0].get_xlim()[1]))
ax[1].set_xlim((0,ax[1].get_xlim()[1]))
ax[2].set_xlim((0,ax[2].get_xlim()[1]))

Similar questions that did not address this question:

RNA-Seq data distribution I don't believe my data follows a negative binomial distribution

How best to normalize count data to compare two distributions

Standardizing or normalizing count data

O.rka
  • 1,250
  • 4
  • 19
  • 30
  • Usually the z-transformation is called "standardizing" in the texts I am familiar with. Any i.i.d. variable (normal or not) with real and finite mean and variance will, after transformation, have a mean of 0 and a standard deviation of 1. This includes count data. – Alexis Jan 31 '17 at 02:38
  • Common normalization in biostatistics is to remove outliers (the big counts) in your skewed data. – SmallChess Jan 31 '17 at 02:46
  • Box-Cox transformation should help. Try square root at first since your data is Poisson distributed. If it will not help - use Box-Cox. You can also fit a zero inflated Poisson and perform a quantile mapping of the data to normal distribution (it is called quantile normalization) – German Demidov Jan 31 '17 at 03:24
  • 1
    @GermanDemidov The OP says the data is not Poisson. Otherwise he would have many standard RNA-Seq normalization techniques. – SmallChess Jan 31 '17 at 04:33
  • 2
    @StudentT sorry I understood the post wrongly. Then - quantile normalisation, fit the most suitable distribution and map to normal. – German Demidov Jan 31 '17 at 05:42

0 Answers0