I have N tumor-normal sample pairs from the same tissue of N subjects. For each sample, I have expression measurements for a fixed gene panel (M genes).
The tumors are additionally also contaminated with normals. Therefore, the measurements in tumors samples are biased by the presence of molecules from normal tissue. Based on experimental evidence I do know the prior purity* of each of the tumor samples, so for each sample S_k, I have a purity value P_k between 0 and 1 (In case of normal samples we don't have a purity value but as it is normal we put P_k=0).
For simplicity, I am modeling the observed gene expressions as follows:
g_ik = t_ik * (P_k) + n_ik * (1-P_k)
In the case of tumor samples, g_ik is the observed expression (measurement) gene i in sample S_k and t_ik the real but unknown expression value originating from the tumor, and n_ik the real but unknown expression value originating from the normal contamination. In the case of normal samples, we make use of P_k=0 and get actually g_ik = n_ik. So it fits also into this modeling.
The question if you have concrete suggestions, on how to compute the t_ik values (GoodReads would be also OK). And if you have another suggestion concerning modeling, this would be welcome.
Thank you in advance Best
- this is the ratio of cancerous cells in a tumor sample, which is not necessarily the same as the ratio of molecules originating from cancerous cells (expression profiles across Tumor/Normal could be very different).