I wanted to test some basic OLS things in python with a sample dataset. One thing I wanted to test was just getting a bootstrap estimate of the variance of the model's coefficients.
Here is the setup code:
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
X, y, cols = list(load_boston().values())[:3]
data = pd.DataFrame(X)
data.columns=cols
data['label']=y
tv_cut=int(.66*len(data))
data_train = data.iloc[:tv_cut]
data_test = data.iloc[tv_cut:]
X_train, y_train = data_train.drop('label', axis=1), data_train.label
X_test, y_test = data_test.drop('label', axis=1), data_test.label
# standardize
X_train-=X_train.mean()
X_test-=X_test.mean()
X_train/=X_train.std()
X_test/=X_test.std()
X_train['intercept']=1
X_test['intercept']=1
def fit_model(X_train, y_train):
beta = X_train.T.dot(X_train)
beta=np.linalg.inv(beta).dot(X_train.T).dot(y_train)
return pd.Series(dict(zip(X_train.columns, beta)))
beta = fit_model(X_train, y_train)
# beta
# CRIM 0.799692
# ZN 0.324976
# INDUS 0.176654
# CHAS 0.210327
# NOX -0.921285
# RM 6.362226
# AGE -1.357903
# DIS -1.876253
# RAD 0.194442
# TAX -0.973645
# PTRATIO -1.422592
# B 0.667066
# LSTAT -0.552090
# intercept 25.227027
# dtype: float64
These parameters ultimately match what both a sklearn
model and a statsmodels
models give me. I then go and compute what I thought should be the standard deviations of the coefficients
betas_std=np.diag(y_train.var()*np.linalg.inv(X_train.T.dot(X_train)))**.5
# CRIM 0.816758
# ZN 0.711078
# INDUS 0.745527
# CHAS 0.489583
# NOX 1.123602
# RM 0.746347
# AGE 0.791013
# DIS 0.874786
# RAD 0.535846
# TAX 0.591263
# PTRATIO 0.609934
# B 0.556539
# LSTAT 0.821042
# intercept 0.471111
when compared to the statsmodels
results below these figures differ by a constant factor of ~2.77
and I am not sure why.
Further, I attempt to take the boostrap estimate with what seem like natural parameters of 1k rounds and sampling 50% each time:
betas = []
for _ in range(1000):
X_temp=X_train.sample(frac=.5)
y_temp=y_train.loc[X_temp.index]
betas.append(fit_model(X_temp, y_temp))
betas = pd.concat(betas, axis=1).T
betas.std()
# CRIM 0.369258
# ZN 0.267626
# INDUS 0.310513
# CHAS 0.231419
# NOX 0.499343
# RM 0.327914
# AGE 0.372655
# DIS 0.356496
# RAD 0.199193
# TAX 0.202993
# PTRATIO 0.231562
# B 0.280280
# LSTAT 0.455236
# intercept 0.186017
changing the number of rounds and sampling % changes these results. The baseline model which I've assumed is correct is:
import statsmodels.api as sm
ols = sm.OLS(y_train, X_train)
ols_result = ols.fit()
ols_result.bse
# CRIM 0.294650
# ZN 0.256525
# INDUS 0.268953
# CHAS 0.176620
# NOX 0.405346
# RM 0.269249
# AGE 0.285362
# DIS 0.315584
# RAD 0.193309
# TAX 0.213301
# PTRATIO 0.220037
# B 0.200775
# LSTAT 0.296195
# intercept 0.169956
What is the cause of the difference between the formulaic calculation of the standard deviation and the statsmodels
result? What is/are the cause(s) of the different values from the sampling loop?