Dimensionality reduction for different input data

Question

I'm doing a random forest regression for predicting the value of a variable. The inputs to achieve this value can be different, e.g. think about different recipes with different ingredients for making bread (different kind of breads). The ingredients can be totally different, e.g. wheat flour vs cornmeal, fresh yeast vs dry yeast, and also, there can be differences in the ingredients brand, so we can have different brands of wheat flavour or cornmeal.

I'm trying to predict some output that I've been measuring previously, e.g the volume of the dough after one hour of rest. So I split my data set into training and evaluating data and train and evaluate the model, generate new features, etc. until I'm comfortable with it's result.

Now, I'm planning to do a dimensionality reduction to obtain a model with only important variables, but I don't know if by doing this I lose the ability to predict some future inputs, e.g. if in dimensionality reduction I lose all the variables related with some recipes, I'm going to be unable to predict the output I'm trying to.

So my question is: what would be the best approach to do dimensionality reductions in this kind of problems? Or maybe I have to change the given approach to the problem.

Dimensionality reduction for different input data

0 Answers0