How to show the variance of local polynomial regression increases with degree

Question

Assume the data follows $y_i=f(x_i)+\varepsilon_i$, $\varepsilon_i$ are iid and have zero expectation with variance $\sigma^2$. The local polynomial regression is $$ {\min_{\alpha(x_0),\beta_j(x_0),j=1,\cdots,d}}{\sum\limits_{i=1}^{N}}{K_\lambda}({x_0},x_i)\left[y_i-\alpha({x_0})-{\sum\limits_{j=1}^{d}}\beta_j({x_0})x_i^j\right]^2 $$ The solution is \begin{eqnarray*} \hat{f}(x)&=&\hat{\alpha}(x_0)+\sum\limits_{j=1}^{d}\beta_j(x_0)x_0^j \\ &=& b(x_0)^T(B^TW(x_0)B)^{-1}B^TW(x_0)y \\ &=& \sum_{i=1}^{N}l_i(x_0)y_i \end{eqnarray*} where $b(x)^T=(1 \, x \, x^2 \, \cdots \, x^d)^T$, $B$ is a $N \times (d+1)$ matrix with $i$th row $b(x_i)^T$ and $W(x_0)$ is a $N \times N$ diagonal matrix with $i$th diagonal element ${K_\lambda}({x_0},x_i)$. ${K_\lambda}({x_0},x_i)$ is a kernel function.

It's easy to see $Var[\hat{f}(x_0)]=\sigma^2\|l(x_0)\|^2$, my question is how to show $\|l(x_0)\|$ increases with $d$.

Some coclusions may help. $\sum_{i=1}^{N}l_i(x_0)=1$, $\sum_{i=1}^{N}l_i(x_0)(x_i-x_0)^k=0, k=1 \, ,\cdots, \, d$, — tankeco, Dec 15 '13 at 10:15

score 1 · Answer 1 · answered Jun 29 '21 at 13:49

I'll try to expand on the answer by @tankeco. For simplicity, we consider unweighted least squares regression, i.e., with weights $W=I$. Also, I believe the conclusion below works for a general least squares regression, so not restricted to a polynomial least squares regression.

Suppose we have an $N\times d$ design matrix $X_d$, an $N$-vector response $y$, and we fit $\hat{\beta}_d$ using least squares. The variance of the predicted value $\hat{f}_d(x_0)$ at an arbitrary $x_0\in\mathbb{R}^d$ is $\text{var}(\hat{f}_d(x_0))=x_0^T(X_d^TX_d)^{-1}x_0$ (assuming WLOG $\sigma^2=1$). Now we add an additional predictor $x_{d+1}\in\mathbb{R}^N$, so we have an augmented $N\times(d+1)$ design matrix $X=[X_d,x_{d+1}]$. We fit $\hat{\beta}$ using least squares again, and the variance of the predicted value $\hat{f}(x_0')$ at $x_0'=(x_0^T,w)\in\mathbb{R}^{d+1}$ is $\text{var}(\hat{f}(x_0'))=x_0'^T(X^TX)^{-1}x_0'$. The question is, how to show that $\text{var}(\hat{f}(x_0'))\geq\text{var}(\hat{f}_d(x_0))$, so that the variance increases with more predictors. Since $\text{var}(\hat{f}(x_0))=\|l(x_0)\|^2$, we see that $\|l(x_0)\|^2$ increases with the dimension $d$.

First, we look at the two optimization problems $$\min_{\beta}(y-X_d\beta)^T(y-X_d\beta)\text{ and }\min_{\beta}(y-X\beta)^T(y-X\beta).$$ Setting $\beta_{d+1}=0$ in the latter, we recover the former problem, so the latter in a sense includes the former, and hence the optimized value is smaller. (This is $L_{d+1}\leq L_d$ in @tankeco's answer.) In equations, $$(y-X_d\hat{\beta}_d)^T(y-X_d\hat{\beta}_d)\geq(y-X\hat{\beta})^T(y-X\hat{\beta}).$$ Now using $X\hat{\beta}=Hy$, where $H=X(X^TX)^{-1}X^T$ and similarly for $H_d$, we see that $$y^T(I-H_d)^T(I-H_d)y\geq y^T(I-H)^T(I-H)y.$$ As a projection matrix, $H$ is idempotent, and so is $I-H$, and hence $$y^T(I-H_d)y\geq y^T(I-H)y.$$ Equivalently, $$y^TH_dy\leq y^THy.$$ If you're familiar with the geometry of least squares projections, then another way to view this is through the identity $$\|y\|^2=\|\hat{y}\|^2+\|y-\hat{y}\|^2.$$ With more predictors, $\|y-\hat{y}\|^2$ gets smaller (as the residual sum of squares), so $\|\hat{y}\|^2$ gets larger. Now expressing $H_d$ and $H$ using $X$, we have $$y^TX_d(X_d^TX_d)^{-1}X_d^Ty\leq y^TX(X^TX)^{-1}X^Ty.$$ Recall that we wish to show, for an arbitrary $x_0'=(x_0^T,w)\in\mathbb{R}^{d+1}$, that $\text{var}(\hat{f}(x_0'))\geq\text{var}(\hat{f}_d(x_0))$, or equivalently, $$x_0'^T(X^TX)^{-1}x_0'\geq x_0^T(X_d^TX_d)^{-1}x_0.$$ It's important to note that the inequality above involving $y$ works for any $y\in\mathbb{R}^N$, and also that $X$ is assumed to be of full rank (with $d+1\leq N$). This means that for any $x_0'\in\mathbb{R}^{d+1}$, there exists a $y\in\mathbb{R}^N$ such that $x_0'=X^Ty$. Now substitute this $y$ in and the proof is complete.

I believe the weighted case is similar using the arguments above.

score -1 · Answer 2 · answered Jan 13 '14 at 14:39

\begin{eqnarray*} &&min \sum_{i=1}^{N}K_{\lambda}(x_0,x_i)\left[y_i-\alpha(x_0)-\sum_{j=1}^{d}\beta_j(x_0)x_i^j\right]^2 \\ &=&min (Y-B\theta)^TW(x_0)(Y-B\theta)=L_d \end{eqnarray*} where $Y=[y_1\,\,y_2\,\,\cdots\,\,y_N], \quad\theta=[\alpha(x_0)\,\,\beta_1(x_0)\,\,\beta_2(x_0)\,\,\cdots\,\,\beta_d(x_0)]^T$

Using the fact that $L_{d+1}\le L_{d}$, it's easy to prove.

How to show the variance of local polynomial regression increases with degree

2 Answers2

Linked