1

Is there a way to obtain the coefficient value 0 for the reference categories of categorical variables in statsmodels GLM.

    import statsmodels.api as sm 
    import statsmodels.formula.api as smf

    model = smf.glm( formula = "cost_tarif_median ~  age + 
            anc_veh + C(formule) + C(veh_usage) + C(categorie) + 
            C(groupe_sra) + C(zonier)" , family = 
            sm.families.Gamma( link = 
            sm.genmod.families.links.log() ) , data = df_train )

    model_fit = model.fit()
  • Why do you expect reference categories to have a coefficient value of 1? I would expect 0. See https://stats.stackexchange.com/questions/285210/what-to-do-in-a-multinomial-logistic-regression-when-all-levels-of-dv-are-of-int/544656#544656, which gives an R package ... I do not know if there is something similar in python – kjetil b halvorsen Nov 21 '21 at 13:10
  • Right : I made a mistake in typing. Correction done. – Fabrice BOUCHAREL Nov 21 '21 at 18:29
  • I am not sure it’s helpful to think of them as zero. They are just not separately identified from the intercept term, and likely non-zero. – dimitriy Nov 21 '21 at 18:33
  • Why do you want the coefficients that are by definition zero? What's the use case? – Josef Nov 21 '21 at 22:02
  • In terms of implementation, the main problem is that the formula handling by patsy does not allow overparameterized categorical variables. Otherwise, `fit_constrained` could be used to constrain the reference coefficient to zero. `fit_constrained` returns the results for the full parameter vector, but the covariance matrix of the parameter estimates is singular because of the imposed constraints. – Josef Nov 21 '21 at 22:05
  • In terms of prediction or t_test it is possible to add arbitrary sets of values of the design matrix with the implied standard errors and confidence intervals. An example would be predicted cell means. But that does not affect the parameter estimation. – Josef Nov 21 '21 at 22:09
  • @Josef : once the model fitted, I want to give coefficients for all variables categories including reference categories ( the ones with coefficient value 0 ) to avoid question. – Fabrice BOUCHAREL Nov 22 '21 at 07:34

1 Answers1

0

It should certainly be possible to obtain an output coefficient table with the value 0 for these coefficients, as the corresponding problem in R is solved with the package gtsummary. But I would not know if there is something in Python! See What to do in a multinomial logistic regression when all levels of DV are of interest?

As for some of the comments:

Why do you want the coefficients that are by definition zero? What's the use case?

One use case is fewer questions from naive users --- there is quite a lot of related questions on this site! So then main use case is improved communication.

I am not sure it’s helpful to think of them as zero. They are just not separately identified from the intercept term, and likely non-zero.

NO, as the model is defined (and here I take the definition of categorical encoding as part of the model. Of course one could get an equivalent model with other encodings, but for matters of parameter interpretation, they must of course be relative to the used encoding) this coefficients are zero, period. Since they are zero by definition, there is now sampling variation, so their standard errors are also 0.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467