0

I was playing around with some data on the presidential elections in 2016 and I got a result that doesn't seem to make sense.

I am running a Logit model on percentage voted for Trump as dependent and my two independent variables are average minimum wage and rate of unemployment from 2012 to 2015.

Here is my code:

import statsmodels.api as sm
import pandas as pd

df = pd.read_csv("Data_sets/pres_and_unemp_data.csv", index_col=0)

y = df["pct"]
X = df[["min_wage", "Rate"]]
result = sm.Logit(y, X).fit()
print(result.summary2())
print(df.corr())

And this is the output:

Optimization terminated successfully.
         Current function value: 0.647033
         Iterations 4
                         Results: Logit
================================================================
Model:              Logit            Pseudo R-squared: -0.383
Dependent Variable: pct              AIC:              2123.6788
Date:               2020-03-03 17:08 BIC:              2134.4813
No. Observations:   1638             Log-Likelihood:   -1059.8
Df Model:           1                LL-Null:          -766.38
Df Residuals:       1636             LLR p-value:      1.0000
Converged:          1.0000           Scale:            1.0000
No. Iterations:     4.0000
------------------------------------------------------------------
            Coef.    Std.Err.     z      P>|z|     [0.025   0.975]
------------------------------------------------------------------
min_wage    0.0418     0.0191   2.1860   0.0288    0.0043   0.0794
Rate        0.0182     0.0212   0.8555   0.3923   -0.0234   0.0598
================================================================

              Rate  min_wage       pct
Rate      1.000000  0.233336 -0.131478
min_wage  0.233336  1.000000 -0.310230
pct      -0.131478 -0.310230  1.000000
Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
  • 2
    Please explain how this "doesn't make sense." You aren't looking at statistics that have much hope of being comparable: the coefficients even in an OLS multiple regression will not be simply related to the raw correlations unless the explanatory variables are orthogonal. – whuber Mar 03 '20 at 16:41
  • 1
    Why are you using a logit model? Is your dependent variable a one if voted for trump and a zero if not voted for trump? – strateeg32 Mar 03 '20 at 17:46
  • If your response is really % voted for Trump then the question is whether your software will treat it correctly. What are the observations? Individual people or areas? – Nick Cox Mar 03 '20 at 18:42
  • That this is logistic reg, instead of OLS (& whether Python handled it correctly) is irrelevant. It's a very general phenomenon. In the dup, the signs are reversed, but it's the same thing & has the same explanation: wage & rate are correlated. I think you will find the information you need in the linked thread. Please read it. If it isn't what you want / you still have a question afterwards, come back here & edit your question to state what you learned & what you still need to know. Then we can provide the information you need w/o duplicating material elsewhere that already didn't help you. – gung - Reinstate Monica Mar 03 '20 at 20:40

1 Answers1

0

I will try to keep it to the basics since if I understand you correctly you just want a general insight and not the details.

A logit model is a nonlinear function that thus is able to incorporate nonlinear relationships between variables. Correlation is strictly a linear relationship between two variables, so it is quite restrictive.

To see how restrictive it can be, I can show you that even in a linear regression a correlation will not give you many details nor any 'nuances'. Let's assume you have $y = b_0 + x_1*b_1$. Then indeed the correlation (between x and y) will be in the same direction, as a matter of fact it will be the same in magnitude as the $b_1$ coefficient.

Now if you have the model $y = b_0 + x_1*b_1 + x_2*b_2$, the sign of $b_1$ and $b_2$ does not have to be the same as the correlation between y and $x_1$ and $x_2$ respectively. In the second model $b_1$ is the effect of $x_1$ given $x_2$. An example would be a regression of the salary of a police officer as a dependent variable and $x_1$ being age and $x_2$ being a dummy variable for gender. Then correlation would give you the effect of age on the salary, but in this, the effect of gender and other noises are incorporated. Whereas with the regression you exclude the effect of gender that age has on salary. Hence 'given'.

strateeg32
  • 131
  • 5
  • I get that in a multiple linear regression the signs don't necessarily need to match, what I am confused about is that even if I run the Logit on only one independent var (min_wage) it still has a positive coefficient. I converted the percentage voted for Trump to a dummy to be more true to the Logit model but that didnt change much. – Dimitar Dimitrov Mar 04 '20 at 07:47