Bootstrapping OLS coefficient variance different results than Python Statsmodels

Question

I wanted to test some basic OLS things in python with a sample dataset. One thing I wanted to test was just getting a bootstrap estimate of the variance of the model's coefficients.

Here is the setup code:

    from sklearn.datasets import load_boston
    import pandas as pd
    import numpy as np
    
    X, y, cols = list(load_boston().values())[:3]
    
    data = pd.DataFrame(X)
    
    data.columns=cols
    
    data['label']=y
    
    tv_cut=int(.66*len(data))
    data_train = data.iloc[:tv_cut]
    data_test = data.iloc[tv_cut:]
    
    X_train, y_train = data_train.drop('label', axis=1), data_train.label
    X_test, y_test = data_test.drop('label', axis=1), data_test.label
    
    # standardize
    X_train-=X_train.mean()
    X_test-=X_test.mean()
    X_train/=X_train.std()
    X_test/=X_test.std()
    
    X_train['intercept']=1
    X_test['intercept']=1
    
    def fit_model(X_train, y_train):
        beta = X_train.T.dot(X_train)
        beta=np.linalg.inv(beta).dot(X_train.T).dot(y_train)
    
        return pd.Series(dict(zip(X_train.columns, beta)))
    
    beta = fit_model(X_train, y_train)
    
    # beta
    # CRIM          0.799692
    # ZN            0.324976
    # INDUS         0.176654
    # CHAS          0.210327
    # NOX          -0.921285
    # RM            6.362226
    # AGE          -1.357903
    # DIS          -1.876253
    # RAD           0.194442
    # TAX          -0.973645
    # PTRATIO      -1.422592
    # B             0.667066
    # LSTAT        -0.552090
    # intercept    25.227027
    # dtype: float64

These parameters ultimately match what both a sklearn model and a statsmodels models give me. I then go and compute what I thought should be the standard deviations of the coefficients

    betas_std=np.diag(y_train.var()*np.linalg.inv(X_train.T.dot(X_train)))**.5
    # CRIM         0.816758
    # ZN           0.711078
    # INDUS        0.745527
    # CHAS         0.489583
    # NOX          1.123602
    # RM           0.746347
    # AGE          0.791013
    # DIS          0.874786
    # RAD          0.535846
    # TAX          0.591263
    # PTRATIO      0.609934
    # B            0.556539
    # LSTAT        0.821042
    # intercept    0.471111

when compared to the statsmodels results below these figures differ by a constant factor of ~2.77 and I am not sure why.

Further, I attempt to take the boostrap estimate with what seem like natural parameters of 1k rounds and sampling 50% each time:

    betas = []
    for _ in range(1000):
        X_temp=X_train.sample(frac=.5)
        y_temp=y_train.loc[X_temp.index]
        betas.append(fit_model(X_temp, y_temp))
    
    betas = pd.concat(betas, axis=1).T
    
    betas.std()
    # CRIM         0.369258
    # ZN           0.267626
    # INDUS        0.310513
    # CHAS         0.231419
    # NOX          0.499343
    # RM           0.327914
    # AGE          0.372655
    # DIS          0.356496
    # RAD          0.199193
    # TAX          0.202993
    # PTRATIO      0.231562
    # B            0.280280
    # LSTAT        0.455236
    # intercept    0.186017

changing the number of rounds and sampling % changes these results. The baseline model which I've assumed is correct is:

    import statsmodels.api as sm
    ols = sm.OLS(y_train, X_train)
    ols_result = ols.fit()
    
    ols_result.bse
    
    # CRIM         0.294650
    # ZN           0.256525
    # INDUS        0.268953
    # CHAS         0.176620
    # NOX          0.405346
    # RM           0.269249
    # AGE          0.285362
    # DIS          0.315584
    # RAD          0.193309
    # TAX          0.213301
    # PTRATIO      0.220037
    # B            0.200775
    # LSTAT        0.296195
    # intercept    0.169956

What is the cause of the difference between the formulaic calculation of the standard deviation and the statsmodels result? What is/are the cause(s) of the different values from the sampling loop?

See also: https://stackoverflow.com/questions/65012059/numpy-variance-vs-homegrown-variance-different-result ; https://stats.stackexchange.com/questions/100041/how-exactly-did-statisticians-agree-to-using-n-1-as-the-unbiased-estimator-for. NumPy's `var()` doesn't employ a Bessel bias correction by default. — Peter O., Dec 30 '20 at 02:34
@PeterO. I think there are too many samples in all parts of this question where I take the var/std that the Bessel correction is mostly moot — CircArgs, Dec 30 '20 at 02:49

Bootstrapping OLS coefficient variance different results than Python Statsmodels

0 Answers0