2

Say I have several bio-reactors (systems) and a bunch of independent variables which can be for example flux-intensity, rotational velocity in the bioreactor, nutrition etc. I also have my response variable or dependent variable which is the weight of the microbes I'm growing. How do I figure out what variable(s) is/are affecting the weight the most? What is the name of the method?

Many thanks

Lennart
  • 348
  • 1
  • 10
  • What kind of model are you using? Regression? – spdrnl Oct 05 '15 at 09:27
  • That's what I'd normally use to start with but let's just say that I have a matrix with different values and I'm not limited to any one model. To clearify: I've one response variable and several independent variables. – Lennart Oct 05 '15 at 09:45
  • Is your interest in analyzing data you have already collected, or in designing a set of prospective experiments? As you note in a comment, there are likely to be interactions among the independent variables in terms of outcome, so please be a bit more specific about what you mean when you ask: "what variable(s) is/are affecting the weight the most?" An indication of how many variables you are considering would also help. – EdM Oct 05 '15 at 14:40
  • I want to analyze existing data so it's not any experimental design and the number of variables are from 5 to 10. Yes, the variables will probably interact with each other and have some kind of limitations to them as well i.e optimal temperature (not a simple trend) and optimal oxygenation and so forth.. Is this analysis more complicated than I thought? – Lennart Oct 05 '15 at 19:22

1 Answers1

1

Your question could be dealt with as a type of variable-selection or feature-selection problem. This is a very broad issue, as you can tell by following that tag on this site.

With only 5 to 10 independent variables, starting with a regression approach would make sense. There are tools for examining variable importance in regression models; for example, the anova function of the rms package in R provides useful measures and plots. (These are essentially formalizations of the general leave-one-out approach suggested by @spdml.) What's complicated is how to apply these tools intelligently in a way that adequately answers your underlying question.

Design of your regression model to start with is really key. The crucial choices depend a lot on your understanding of the subject matter. Will the first-order linear approximations inherent in linear regressions be adequate, or do you need to consider more complicated relations? Should you be working in the scales that the variables are usually expressed in, or should they be transformed (e.g., logarithmically) to fit linear relations better? Which independent variables do you expect to have effects on outcome more-or-less independent of the levels of other variables, and which may need to be included together via interaction terms in the model? A good deal of preliminary data exploration might be needed to help design the model. And after you've designed and fit a model, you will need to examine the results to see if the assumptions you made were valid.

Another complication is your reliance on a particular set of data. Yes, you can estimate the relative importance of predictor variables on a particular data set--but would they be the same variables chosen based on a different sample of data from the same systems? In many circumstances a model fit manages to match a particular data set but fails miserably when applied to another similar situation. Validation techniques like bootstrapping can be very important to make sure that the "important" variables maintain their importance beyond the data that you have already collected.

In this particular application, it seems that you have the ability to design controlled experiments to get at the underlying issues. Use these estimates of importance of variables, based on existing data, as guides to designing solid experimental tests for optimizing your processes.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • So basically, I need to manipulate the data to be linear. I think I know how to cross validate the data, but since I won't have that much data I'll basically need to re-use it with k-fold CV. Anyway, I'll try to look into the rms package and see if I can work some magic. – Lennart Oct 09 '15 at 15:40