Question:
when we say correlation is 1
in boss Ridge and Elastic Net, does it only mean $x_1 = x_2?$
Story:
Ridge will trends to allocate the similar coefficients to the high correlated features
which is mentioned here:
Why Lasso or ElasticNet perform better than Ridge when the features are correlated
Recently I saw the similar result related in Elastic Net
from LASSO in Wiki:
https://en.wikipedia.org/wiki/Lasso_(statistics)#cite_note-Zou_2005-5
Elastic Net
can convert Ridge part into OLS and then become an equivalent LASSO. I believe above result must be related to Ridge.
Analysis:
Intuitively if $x_1=x_2,$ their correlation is 1. Then $$\beta_1x_1 + \beta_2x_2=(\beta_2+\beta_1)x_1.$$
Assume the sum is constant: $\beta_2+\beta_1 = C.$ To minimize the $L_2$-norm $\beta_1^2+\beta_2^2,$ the optimal solution is $$\beta_1 = \beta_2.$$ This deduction is also used to partially proof the sparsity for the correlated features in LASSO.
However when $2x_1=x_2,$ their correlation is still 1. In this case, the optimal solution should be $$2*\beta_1 = \beta_2.$$
So when we say correlation is 1
in boss Ridge and Elastic Net, does it only mean $x_1 = x_2?$
Test:
I tested the example in Sklearn. The result is consistent with my conclusion
from sklearn.linear_model import Ridge
import numpy as np
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
X[:,1] = X[:,0]
print('X[1] = X[0]:')
for x in range(10):
clf = Ridge(alpha=x*0.5)
clf.fit(X, y)
print(clf.coef_[0:2])
X[:,1] = 2*X[:,0]
print('X[1] = 2*X[0]:')
for x in range(10):
clf = Ridge(alpha=x*0.5)
clf.fit(X, y)
print(clf.coef_[0:2])
Here are the result. It seems the conclusion is correct.
If $x_0=x_1,$ we have $\beta_1 = \beta_2:$
Ridge(alpha=0.0, fit_intercept=False)
[0.03862135 0.03862135]
Ridge(alpha=0.1, fit_intercept=False)
[0.02039476 0.02039476]
Ridge(alpha=0.2, fit_intercept=False)
[0.00455329 0.00455329]
Ridge(alpha=0.30000000000000004, fit_intercept=False)
[-0.00931821 -0.00931821]
Ridge(alpha=0.4, fit_intercept=False)
[-0.02154448 -0.02154448]
Ridge(alpha=0.5, fit_intercept=False)
[-0.03238315 -0.03238315]
Ridge(alpha=0.6000000000000001, fit_intercept=False)
[-0.04204119 -0.04204119]
Ridge(alpha=0.7000000000000001, fit_intercept=False)
[-0.05068676 -0.05068676]
Ridge(alpha=0.8, fit_intercept=False)
[-0.0584579 -0.0584579]
Ridge(alpha=0.9, fit_intercept=False)
[-0.06546893 -0.06546893]
If $2x_0=x_1,$ we have $2\beta_1 = \beta_2:$
Ridge(alpha=0.0, fit_intercept=False)
[-1.27062478e+15 6.35312388e+14]
Ridge(alpha=0.1, fit_intercept=False)
[0.0082309 0.01646181]
Ridge(alpha=0.2, fit_intercept=False)
[0.00185263 0.00370526]
Ridge(alpha=0.30000000000000004, fit_intercept=False)
[-0.00381988 -0.00763976]
Ridge(alpha=0.4, fit_intercept=False)
[-0.00889342 -0.01778684]
Ridge(alpha=0.5, fit_intercept=False)
[-0.01345436 -0.02690873]
Ridge(alpha=0.6000000000000001, fit_intercept=False)
[-0.01757333 -0.03514665]
Ridge(alpha=0.7000000000000001, fit_intercept=False)
[-0.02130859 -0.04261718]
Ridge(alpha=0.8, fit_intercept=False)
[-0.02470868 -0.04941737]
Ridge(alpha=0.9, fit_intercept=False)
[-0.02781434 -0.05562869]