Examples of Lurking Variable and Influential Observation

Question

I have read possible explanations for Lurking Variables and Influential Observations but I can't seem to construct a good example for myself.

A well-designed experiment includes design features that allow researchers to eliminate extraneous variables as an explanation for the observed relationship between the independent variable(s) and the dependent variable. These extraneous variables are called lurking variables.

It is possible for a single observation to have a great influence on the results of a regression analysis. Such a variable is an influential variable

For IV I created an example as : Assume a residential building, the X axis is the number of years spent in education and Y axis is the earnings, we expect a nice rising graph (or whatever) but suddenly there is a guy who has studied very little but earns a really high amount. He is a Influential Variable. Correct?

You seem to be mixing up "observation" and "variable"; read the second italicized paragraph, which switches from talking about a "single observation" to describing it as "such a variable". You might want to clear up the terminology so we are all on the same page as to what refers to what. — jbowman, Jul 24 '12 at 19:52
@jbowman, Thanks for pointing that out. But then if I have only 2 variables and say, hundred observations for each; what would be a good example to understand them? — , Jul 24 '12 at 20:48
No, i think what jbowman meant is that the guy in your last paragraph could be an influential *observation*, never a *variable* — user603, Jul 24 '12 at 20:57
Single observations have less influence when the sample size is large. Influential observations depend on the parameter being estimated. I will give you a simple example from my 1982 paper. — Michael R. Chernick, Jul 24 '12 at 21:12

score 1 · Accepted Answer · answered Jul 24 '12 at 22:05

My 1982 paper "The Influence Function and Its Application to Data Validation" in the American Journal of Mathematical and Management Sciences was judged the best theoretical paper in that journal for the year 1982 and as a consequence I was awarded the Jacob Wolfowitz Prize for 1983.

The paper deals with Hampel's influence function and the way it can be used to detect outliers. In my case I was considering multivariate outliers. My argument regarding data validation which was a concern for the Department of Energy's data bases at that time was that outliers that effect estimates important to the intended users of the data base should be emphasized and detected. There are so many distance functions that can be used to determine multivariate outliers. I proposed using the influence function for a parameter of interest to the user of the data to be the metric to use.

Hampel's influence function depends on the parameter being estimated and the multivariate data point being considered. I took a simple but an important and illustrative case. For bivariate data (X$_1$,X$_2$) consider the correlation between the pair and the influence of a particular single point (x$_1$,x$-2$) Formally Hampel's influence function is a directional derivative. Informally as Mallows pointed out it essentially represents the difference between an estimate of the parameter based on an entire sample that includes (x$_1$,x$-2$) and the sample that contains every other point but leaves (x$_1$,x$-2$) out.

For the bivariate correlation you can do the formal mathematics and show that the influence function for bivariate correlation ( which is closely related to the influence on the slope parameter of a simple linear regression of say X$_2$ on X$_1$) for contours of constant values that are hyperbolae.

Take a scatter diagram of the data and superimpose these contours. You will see them move out from low values to high values similar to how temperature or pressure contours might look on a map. The direction of greatest increase is the direction to look for the most influential observations and the contours tell you the value of the influence at any point of interest. I illustrated this using the DOEs FPC Form 4 data which provides data at power plants comparing energy consumption (possibly of coal) to electricity generation. There is reasonable positive correlation between the two. Estimates from the data I had wer at about 0.48.

I included two figures (one for each of the two plant) each showing scatter plots with a high influence contour superimpopsed. Based on this each plant contained three outliers (based on high influence. The point of lowest influence are near the mean of the bivariate sample with 0 influence at the sample mean vector (which happens to always be a point on the least squares regression line). The outliers tended to be in the upper right corner of the scatter plot (3 points there) or at very high values of x$_2$ with very low values of X$_1$. The removal of one outlier (having the highest estimated influence) based on a sample of 36 points actually changed the estimated correlation from 0.47 to 0.77 at plant A and the largest influential outlier at plant B changed it from 0.48 to 0.85.

We found that one of the outliers had a consumption value of 93 and generation of 330 (I did not give the units). This was a very large consumption for a generation of 330. In checking with the plant we discovered that we found an error. The generation was 330 but the consumption should have only been one unit. This could have been a recording slip of two decimal places. There is a lot more detail in the paper along with a computer program to generate bivariate correlation influence function estimates from a given data set.

This work was actually done in 1979. At hat time computers were very slow (relative to today) and we programmed in Fortran. So the code in the paper is in Fortran. I think the paper is accessible over the internet and I will try to find a link for it.

Examples of Lurking Variable and Influential Observation

1 Answers1

Linked