0

I've got a dataset with size of an aneurysm as a binary variable (above or under a threshold) and location as a categorical variable. I'm interested to know whether any of the locations have statistically smaller or larger aneurysms than the other locations (I've also got risk factors/confounders that I will add to the final model, but to keep it simple I only include these in this question). In other words, I would like to know, if a patient has an aneurysm in location X, is it statistically more likely to be a small or a big aneurysm, compared to the mean aneurysm size?

Here's an example of my data:

clear
input float(sizeBinary locationCat)
0 1
1 6
. 7
0 3
0 1
1 5
0 5
. 7
1 5
1 1
. 1
. 1
0 4
1 4
1 7
1 7
1 1
1 1
0 7
0 3
0 1
1 1
1 7
1 5
1 5
1 7
0 1
1 .
1 7
1 2
1 5
1 6
0 6
1 7
1 1
0 4
0 1
. 1
0 7
0 3
1 1
1 1
0 1
. 5
1 7
1 7
0 1
0 1
1 6
0 1
. 7
1 1
1 1
0 1
1 3
0 7
0 1
0 3
0 5
. 1
1 7
1 7
. .
1 3
1 7
1 1
0 7
0 1
0 1
. .
0 3
1 5
1 1
0 6
1 1
1 2
1 .
1 5
0 1
1 7
0 1
0 7
. .
1 2
0 1
0 1
. 7
. 1
. 1
1 1
1 7
1 1
1 .
1 1
0 1
1 6
0 1
0 1
1 7
1 6
0 1
1 7
1 1
1 7
1 6
0 1
1 1
0 1
0 2
1 1
1 3
1 7
0 .
1 1
0 1
1 6
1 5
0 7
1 5
1 6
0 6
0 .
1 7
0 1
1 7
0 7
1 6
0 3
0 1
0 2
1 7
1 7
1 5
0 1
1 7
0 7
0 4
0 3
0 1
0 2
0 7
1 .
1 1
1 6
1 1
0 6
0 1
1 1
1 5
1 7
1 1
0 3
0 7
0 6
1 3
1 .
0 1
. 6
0 1
1 7
0 7
0 .
1 1
. .
1 7
1 1
1 6
1 1
1 6
1 6
0 1
. 5
1 7
0 .
. 1
0 1
end

I've ran a logistic regression on both variables yielding:

. logistic sizeBinary i.locationCat

Logistic regression                             Number of obs     =        149
                                                LR chi2(6)        =      17.61
                                                Prob > chi2       =     0.0073
Log likelihood = -93.258808                     Pseudo R2         =     0.0863

------------------------------------------------------------------------------
  sizeBinary | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 locationCat |
          2  |   1.269231   1.088458     0.28   0.781     .2363596     6.81566
          3  |   .6346154   .4227532    -0.68   0.495     .1719797    2.341769
          4  |   .4230769   .5009663    -0.73   0.468     .0415441    4.308528
          5  |   6.980769   5.669801     2.39   0.017     1.420872    34.29665
          6  |        3.3   1.940242     2.03   0.042     1.042433    10.44671
          7  |          3   1.335371     2.47   0.014     1.253808     7.17813
             |
       _cons |   .7878788   .2066054    -0.91   0.363     .4712473    1.317255
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

From this, I can deduce that location 5, 6 and 7 harbor statistically significantly larger aneurysms than location 1.

However, I'm interested to know whether ANY location harbors statistically significantly smaller or larger aneurysms than the mean, therefore I run a margins command:

. margins i.locationCat

Adjusted predictions                            Number of obs     =        149
Model VCE    : OIM

Expression   : Pr(sizeBinary), predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 locationCat |
          1  |    .440678   .0646347     6.82   0.000     .3139963    .5673596
          2  |         .5   .2041241     2.45   0.014      .099924     .900076
          3  |   .3333333   .1360828     2.45   0.014      .066616    .6000506
          4  |        .25   .2165064     1.15   0.248    -.1743447    .6743447
          5  |   .8461538   .1000683     8.46   0.000     .6500237    1.042284
          6  |   .7222222   .1055718     6.84   0.000     .5153053    .9291391
          7  |   .7027027   .0751416     9.35   0.000     .5554279    .8499775
------------------------------------------------------------------------------

However, it seems ALL locations have significantly larger aneurysms (all coefficients positive)? Or am I misunderstanding something?

Also they are almost all significant?

Surely I'm doing something wrong here.

EDIT: As response to Dimitriy's answer,

margins g.locationCat produces:

. margins g.locationCat

Contrasts of adjusted predictions               Number of obs     =        149
Model VCE    : OIM

Expression   : Pr(sizeBinary), predict()

------------------------------------------------
             |         df        chi2     P>chi2
-------------+----------------------------------
 locationCat |
(1 vs mean)  |          1        1.78     0.1828
(2 vs mean)  |          1        0.05     0.8153
(3 vs mean)  |          1        2.72     0.0992
(4 vs mean)  |          1        2.35     0.1252
(5 vs mean)  |          1        9.27     0.0023
(6 vs mean)  |          1        3.01     0.0828
(7 vs mean)  |          1        3.76     0.0524
      Joint  |          6       21.92     0.0013
------------------------------------------------

--------------------------------------------------------------
             |            Delta-method
             |   Contrast   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
 locationCat |
(1 vs mean)  |  -.1014778   .0761659     -.2507601    .0478046
(2 vs mean)  |  -.0421557   .1804968      -.395923    .3116116
(3 vs mean)  |  -.2088224   .1266678     -.4570866    .0394418
(4 vs mean)  |  -.2921557   .1905239     -.6655757    .0812642
(5 vs mean)  |   .3039981    .099849      .1082977    .4996985
(6 vs mean)  |   .1800665   .1038182     -.0234134    .3835464
(7 vs mean)  |    .160547   .0827662     -.0016719    .3227658
--------------------------------------------------------------

And margins, dydx(locationCat) produces:

. margins, dydx(locationCat)

Conditional marginal effects                    Number of obs     =        149
Model VCE    : OIM

Expression   : Pr(sizeBinary), predict()
dy/dx w.r.t. : 2.locationCat 3.locationCat 4.locationCat 5.locationCat 6.locationCat 7.locationCat

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 locationCat |
          2  |    .059322   .2141128     0.28   0.782    -.3603314    .4789755
          3  |  -.1073446   .1506525    -0.71   0.476     -.402618    .1879287
          4  |   -.190678   .2259483    -0.84   0.399    -.6335285    .2521726
          5  |   .4054759   .1191272     3.40   0.001     .1719908     .638961
          6  |   .2815443   .1237863     2.27   0.023     .0389276    .5241609
          7  |   .2620247   .0991156     2.64   0.008     .0677617    .4562877
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
Paze
  • 1,751
  • 7
  • 21

1 Answers1

2

Your regression has 149 observations, your example data has 100, 20 of which are unusable because of missing data. Since your regression output does not match your data, I will not attempt to replicate your regression here.

Your margins does not produce coefficients, either additive ones or multiplicative ones. It produces conditional expected probability of having an aneurysm above the threshold for each locations. Since probabilities fall in [0,1], these will generally be positive. For example, Location 1 has an expected probability of a large one equal to 0.44, and Location 7 has 0.70.

You probably have something like this in mind

margins g.locationCat
margins, dydx(locationCat)

The first compares the expected probability at each location with the global mean probability. The second calculates the change in expected probability relative to location 1. So Location 7 versus 1 should be $\approx .26$.

If you have controls, the logic is very similar.

dimitriy
  • 31,081
  • 5
  • 63
  • 138
  • Thank you Dimitriy. I have updated my answer to include my full dataset in the example and the results from your two commands. To follow up, does this mean that none of the locations have a significantly larger or smaller aneurysms compared to the global mean (at alpha 0.05)? And in the dydx option, I have a 26% higher probability of harboring a large (1 in the binary size variable) aneurysm in location 7, than in location 1? Does this mean that for the negative coefficients, I have a 10% higher probability of harboring a smaller aneury in loc 3 than in loc 1 (-0.1 coef)? – Paze Sep 01 '20 at 19:20
  • Or do the negative coefficients mean I have a 10% less probability of harboring a large aneurysm....Maybe this is the same, I'm getting a bit tired. – Paze Sep 01 '20 at 19:24
  • You missed location 5 (p-value of 0.2). Location 7 also come close with p-value slightly above 5%. Location 7 has an increase of 26 percentage point above L1 (not percent, since that would be $\frac{.2620247}{.440678 }\approx 60\%$). Negative means a lower probabilbity: $\Pr(L2)-\Pr(L1) = .3333333 - .440678 =-.1073447$, of 11% pps lower. – dimitriy Sep 01 '20 at 19:32
  • Right, hopefully my last followup: So looking back at the margins of the global means, does this mean location 5 has 30 pps increase of harboring a large aneurysm, and this is statistically significant? – Paze Sep 01 '20 at 19:36
  • Yes, you can either look at the 95% CI interval (which does not overlap zero), or the p-value 0.0023 in the table above ( P>chi2 column). – dimitriy Sep 01 '20 at 19:38
  • Thank you. So to sum up this question. I have answered my research question of "Does any location harbor significantly smaller or larger aneurysms in general?" The answer to this question is location 5, harboring larger aneurysms in general. Correct? – Paze Sep 01 '20 at 19:43
  • Yes, that sounds right, though "in general" does not have a well-defined meaning for me. – dimitriy Sep 01 '20 at 19:48
  • Thank you Dimitriy, I will mull this over and post any thoughts I may have afterwards in a thread of their own as to not extend this discussion in the comments. Again, thanks a lot for your help, I learned a lot. – Paze Sep 01 '20 at 19:51
  • Hi Dimitriy, I'm not sure we answered the question of why almost all probabilities in the margins command are significant. What does it mean that L1 has a 44% probability with a p=0.000? – Paze Sep 03 '20 at 10:57
  • It means you would be very unlikely to find an expected probability of a big aneurysm in L1 as big as this or larger, had you actually sampled from a population where no big aneurysms ever happen in L1 (the zero point null). Since big aneurysms are not rare in your data in any location, everything will be individually significantly different from zero (and collectively as well, I would wager). I will admit that this is not a very interesting null to test in this setting, but software has to have some defaults, so this won't be the last time you encounter something like this. – dimitriy Sep 03 '20 at 14:22
  • [This](https://stats.stackexchange.com/a/72583/7071) is the clearest way I know to explain hypothesis testing without math. Spend some time with this parable, and try to analogize your own examples to it. – dimitriy Sep 03 '20 at 14:28