I have a count dataset (num_samples=7, num_attributes=14117)
that I want to normalize (for lack of a better word). Each of the samples has a different number of counts. Ultimately, I want to get rid of noise, scale everything onto the same level, and then create a correlation-based network. This question focuses on a single vector from that (n=14117)
). This dataset has a lot of zeros (3236 zero
values) and low values so I thought dropping all values that were $<=$ 30.
MANY papers looking at similar datasets have used z-score normalization (mean centered divided by standard deviation) but these counts datasets are far from normally distributed. I've heard of zero-inflated poisson models for count data but, even then, this data isn't poisson distributed.
Is there a method analogous to z-score normalization for skewed datasets with lots of low values? It wouldn't make sense to transform the data into units of standard deviation if the data isn't normally distributed. Is the answer as simple as cutting the noise and the pushing all values in between a certain interval?
u = DF_data.iloc[0,:]
v = u.compress(lambda x:x >= 30)
w = np.log(v)
# Plot data
fig, ax = plt.subplots(figsize=(15,3), ncols=3)
u.plot(kind="kde", ax=ax[0])
v.plot(kind="kde", ax=ax[1])
w.plot(kind="kde", ax=ax[2])
# Plot mean
ax[0].axvline(x=u.mean(), linestyle="-", color="black", linewidth=1)
ax[1].axvline(x=v.mean(), linestyle="-", color="black", linewidth=1)
ax[2].axvline(x=w.mean(), linestyle="-", color="black", linewidth=1)
# Labels
ax[0].set_title("raw data (n=14,117 attributes)")
ax[1].set_title("raw data >= 30 counts (n=3,846 attributes)")
ax[2].set_title("log transformed >= 30 counts (n=3,846 attributes)")
# Limits for plot
ax[0].set_xlim((0,ax[0].get_xlim()[1]))
ax[1].set_xlim((0,ax[1].get_xlim()[1]))
ax[2].set_xlim((0,ax[2].get_xlim()[1]))
Similar questions that did not address this question:
RNA-Seq data distribution I don't believe my data follows a negative binomial distribution
How best to normalize count data to compare two distributions