1

I'm currently working on a statistical modelling problem in biology. We have cellular measurements of proteins in every cell in a tissue, and I'm using regression analysis to see if a given protein is affected by the protein content in nearest neighbours.

Overall, I'd like a general linear model of the form $Y = \beta X + B$ where $Y$ is the matrix of measurements of multiple proteins across cells, and $X$ is likewise but the average over nearest neighbours of a given cell.

However, in each cell we believe there is a network that internally affects protein levels, so I've used LASSO (we believe the underlying matrix is sparse, and have many proteins so many noise variables) within each cell to regress a given protein off all the others, leaving a matrix of residuals $\tilde{Y}$ (which naturally has a covariance of near 0), to give a linear model $\tilde{Y} = \beta X + B$.

As we believe the interactions in this model to also be sparse, I've been using LASSO regression with the significance testing found in

Bühlmann, Peter, Markus Kalisch, and Lukas Meier. "High-Dimensional Statistics with a View Toward Applications in Biology." Annual Review of Statistics and Its Application 1  (2014): 255-278.

However, as $\Sigma = 0$ for $\tilde{Y}$ and we want to select different predictor variables for each response, I've been repeating multiple regression for each response variable, rather than performing general linear regression on all the response variables at once.

The question I'd like to ask is, is this a valid approach? Are there any obvious flaws with the method? Any other comments generally?

Thanks in advance.

kezz_smc
  • 266
  • 1
  • 8

0 Answers0