Which techniques can I use to select the important variables that I should keep?

Question

In a project that I'm working in, there are 5 equal soil humidity sensors displayed in the exact same point but at different depths (20,40,60,80 and 100cm from the surface, respectively). Therefore, each sensor measures the humidity level at these different depths providing info that is later used into a prediction model (including some other variables though). As the differences of conditions between each sensor vary so little, there's little difference and a high linear correlation between each sensor.

How can I select a subset of this sensors (without having a response variable) or demonstrate that among those variables there is redundant information (if it is redundant) and so I can dispense with using some of them, considering that the fewer sensors I use, the cheaper the system gets?

?cm I presume. And, BTW you cannot determine what contributes to a model without having a model. When you include a predicted (dependent) variable, then you can use ANOVA to determine which independent parameters (sensors) can be despensed with. — Carl, Oct 25 '16 at 04:07

score 1 · Accepted Answer · answered Oct 25 '16 at 10:17

1

Without examining the relation of the candidate predictors to the response, you can perform a redundancy analysis to determine how well each can be predicted from the others, or from subsets of the others.

The idea is to regress $x_1$ on $x_2, \ldots, x_5$, then $x_2$ on $x_1, x_3, \ldots, x_5$, & so on. A high coefficient of determination for a regression with $x_i$ as response suggests $x_i$ might be considered redundant. You can follow a stepwise procedure of removing the most redundant variable & repeating the analysis—but be sure to continue checking how well excluded variables are predicted by reduced subsets. It would usually be sensible to allow for curvilinear relations, say by using a spline basis function for predictors. Have a look at the redun function in Frank Harrell's Hmisc package for R.

There aren't any guarantees, of course, that slight variations in the profile of soil humidity by depth aren't highly predictive of whatever it is you want to predict—that's a possibility you need to be confident in rejecting before carrying on with any form of data reduction.

answered Oct 25 '16 at 10:17

Scortchi - Reinstate Monica

27,560
8
81
248

2

Because this answer appears not to use the prediction model at all, and it's the model that matters to the OP, it's difficult to see why this procedure would work. As a concrete example of how it could fail dramatically, suppose the prediction model depends on--and is exquisitely sensitive to--the *moisture differences* between the deep ($x_5$) and surface ($x_1$) soils. Suppose that $x_i = 1 - i x_5$ for $i=1,2,3,4$: each with maximal coefficient of determination. Then you would eliminate all the sensors except for $x_5$--from which (alone) the essential information cannot be recovered. – whuber Oct 25 '16 at 16:45
2

@whuber: Thanks! That's the kind of possibility I was trying to get at in my last paragraph, though the caveat was vague & woefully understated. Nick Stauner has compiled some real-life examples in the similar context of principal component regression: [Low variance components in PCA, are they really just noise? Is there any way to test for it?](http://stats.stackexchange.com/a/87231/17230). Furthermore, in this case, even if a relation between the response & humidity gradient is unlikely, it seems a little silly to completely ignore the quantitative information about the depth of each sensor – Scortchi - Reinstate Monica Oct 26 '16 at 08:36
This answer does address the New User's constraints. It is clearly stated, in the second paragraph, that there is no response variable. So how can you do a regression with out a response variable? – grldsndrs Oct 27 '16 at 07:32
Thanks for all the comments! But having in mind the fact that each sensor is highly linearly correlated to each other, except the one at the deeper position, the results show a high coeficient of determination for whatever sensor I choose to regress. – NewUser Oct 31 '16 at 15:29
Something I should've said in the main question, is that the selection aims to recover the info for all the sensors just from the ones that have been choose once they are instaled in other similar fields. In summary Select fewer sensors from a group of 4 or 5 Instal the subgroup in other equivalent fields From the subgroup of sensors, generate the data of the missing sensors. @whuber – NewUser Oct 31 '16 at 15:38
1

(1) I don't understand the "but" in your first comment - what you describe doesn't present any problem. (2) The regressions from the redundancy analysis can be used to predict observations from missing sensors - if you can assume that the same relationships hold in different fields. (3) Do bear @whuber's comment in mind: it's only subject-matter knowledge that can allow you to dismiss residuals from one of these regressions as unimportant. Would a pilot study in which you actually measure the response be possible? [BTW you can edit your q. to include this info.] – Scortchi - Reinstate Monica Oct 31 '16 at 17:38
Thanks for the answers @Scortchi. One more thing ¿Do you know any papers/articles where this application of the regression have been used before? – NewUser Nov 02 '16 at 00:14

grldsndrs · Answer 2 · 2016-10-29T01:45:03.773

-3

Principal Component Analysis:

If you use PCA to solve this you could use each sensor as a dimension in a PCA and then use the Principle Component (sensor/sensors) to infer the depth/depths at which to place the acceptable number of sensors.

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.

This transformation is defined in such a way that

the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and

each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

edited Oct 29 '16 at 01:45

answered Oct 25 '16 at 02:37

grldsndrs

454
3
11

If these factors are then used in a model, it becomes Principal Component Regression (PCR): https://en.wikipedia.org/wiki/Principal_component_regression – T3am5hark Oct 25 '16 at 02:44
I'm not sure why this was downvoted, this seems like a perfectly valid use of PCA. Possibly because it doesn't address using less sensors? – Matthew Drury Oct 25 '16 at 03:58
actually Matthew, the answer does address using less sensors. "The first PCA has the largest possible variance accounting for as much of the variability as possible". I left it to the reader to extract the meaning that one could take the 1st components leaving those that don't account or significant variance. So I too would like to know why the down vote was cast??? – grldsndrs Oct 25 '16 at 05:00
2

Each component is defined in terms of the measurements from *all* sensors, so discarding components doesn't mean you get away with fewer sensors. – Scortchi - Reinstate Monica Oct 25 '16 at 08:33
2

In a plain or vanilla PCA the number of PCs is the same as the number of input variables. Whether you choose to use all of them is a different question. – Nick Cox Oct 25 '16 at 10:20
@Scortchi. I am not sure where you're getting that the components are composites of all sensors. I actually have some domain knowledge here. I have used such soil moisture detection sensors and in my experience each sensor measurement is independent of any other. This is how I read the OP. As such, finding the Principle Component sensors allows one to rely on those sensors, while disregarding the others. The question is really what depths more most important. New User want to get rid of sensors, not data. So my approach is from a hardware POV. – grldsndrs Oct 26 '16 at 01:19
1

Surely there's not much point to a principal component analysis of independently distributed variables. In any case it'd be better to explain all this in your answer, which I've just noticed is a cut-&-paste from a Wikipedia article. – Scortchi - Reinstate Monica Oct 26 '16 at 08:55
1

Although the sensor measurements might be *physically* independent, they are not statistically independent: there will be patterns resulting from the physical proximity of the sensors and relatively slow vertical changes in humidity with depth. Indeed, how to identify those patterns and using them to eliminate one or more sensors is the entire point of this thread. – whuber Oct 26 '16 at 13:59
Ok, so before I elaborate on my answer, which I agree should have included the points I make in my comments and which is really a pointer to the deeper explanation on Wikipedia, I would like to ask Scortchi and whuber this question. If the sensors were placed in different boxes of soil each at their respective depths as in the OP, would you consider their data to be statistically independent then? I suspect that you might, but if my suspicions are wrong, can you say why? I submit to you that such sensors will function exactly the same weather or not there is another above it or not. – grldsndrs Oct 27 '16 at 07:17
It's not being suggested that the sensors influence each other, & need to be prevented from doing so; but that (as the OP says) the observations made with them are highly correlated - e.g. after rain they all give relatively high readings, while after a dry spell they all give relatively low readings. – Scortchi - Reinstate Monica Oct 27 '16 at 08:41
1

Yes highly correlated and linear, because they are close in proximity. So sensor3's data lags behind sensor2's data which lags behind sensor1's data (assuming sensor1 is closest to the surface). Why can't PCA be used to find out which sensor is redundant? The data are all likely to contain the same information, but with a time shift corresponding to depth. – grldsndrs Oct 27 '16 at 09:03
That time-shift idea's interesting, & yet another reason why my own answer is far too pat. (Though it's not quite clear from the question whether there's one set of five sensors with observations made at different times or many sets of five sensors with observations made at different locations.) Still, if you've thought of a way PCA could help, which isn't a possibility I'm dismissing out of hand, you need to explain/illustrate it explicitly. – Scortchi - Reinstate Monica Oct 27 '16 at 12:11
Ok, I think it make sense to use each sensor as a dimension in a PCA and use the Principle Component (sensor/sensors) to infer the depth/depths at which to place the sensor/sensors. – grldsndrs Oct 29 '16 at 01:40

Which techniques can I use to select the important variables that I should keep?

2 Answers2