Degrees of freedom of $\chi^2$ in Hosmer-Lemeshow test

Question

The test statistic for the Hosmer-Lemeshow test (HLT) for goodness of fit (GOF) of a logistic regression model is defined as follows:

The sample is then split into $d=10$ deciles, $D_1, D_2, \dots , D_{d}$, per decile one computes the following quantities:

$O_{1d}=\displaystyle \sum_{i \in D_d} y_i$, i.e. the observed number of positive cases in decile $D_d$;

$O_{0d}=\displaystyle \sum_{i \in D_d} (1-y_i)$, i.e. the observed number of negative cases in decile $D_d$;

$E_{1d}=\displaystyle \sum_{i \in D_d} \hat{\pi}_i$, i.e. the estimated number of positive cases in decile $D_d$;

$E_{0d}= \displaystyle \sum_{i \in D_d} (1-\hat{\pi}_i)$, i.e. the estimated number of negative cases in decile $D_d$;

where $y_i$ is the observed binary outcome for the $i$-th observation and $\hat{\pi}_i$ the estimated probability for that observation.

Then the test statistic is then defined as:

$X^2 = \displaystyle \sum_{h=0}^{1} \sum_{g=1}^d \left( \frac{(O_{hg}-E_{hg})^2}{E_{hg}} \right)= \sum_{g=1}^d \left( \frac{ O_{1g} - n_g \hat{\pi}_g}{\sqrt{n_g (1-\hat{\pi}_g) \hat{\pi}_g}} \right)^2,$

where $\hat{\pi}_g$ is the average estimated probability in decile $g$ and let $n_g$ be the number of companies in the decile.

According to Hosmer-Lemeshow (see this link) this statistic has (under certain assumptions) a $\chi^2$ distribution with $(d-2)$ degrees of freedom.

On the other hand, if I would define a contingency table with $d$ rows (corresponding to the deciles) and 2 columns (corresponding to the true/false binary outcome) then the test-statistic for the $\chi^2$ test for this contingency table would the the same as the $X^2$ defined above, however, in the case of the contingency table, this test statistic is $\chi^2$ with $(d-1)(2-1)=d-1$ degrees of freedom. So one degree of freedom more !

How can one explain this difference in the number of degrees of freedom ?

EDIT: additions after reading comments:

@whuber

They say (see Hosmer D.W., Lemeshow S. (1980), A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics, A10, 1043-1069) that there is a theorem demonstrated by Moore and Spruill from which it follows that if (1) the parameters are estimated using likelihood functions for ungrouped data and (2) the frequencies in the 2xg table depend on the estimated parameters, namely the cells are random, not fixed, that then, under appropriate regularity conditions the goodness of fit statistic under (1) and (2) is that of a central chi-square with the usual reduction of degrees of freedom due to estimated parameters plus a sum of weighted chi-square variables.

Then, if I understand their paper well, they try to find an approximation for this 'correction term' that, if I understand it well, is this weighted sum of chi-square random variables, and they do this by making simulations, but I must admit that I do not fully understand what they say there, hence my question; why are these cells random, how does that influence the degrees of freedom ? Would it be different if I fix the borders of the cells and then I classify the observations in fixed cells based on the estimated score, in that case the cells are not random, though the 'content' of the cell is ?

@Frank Harell: couldn't it be that the 'shortcomings' of the Hosmer-Lemeshow test that you mention in your comments below, are just a consequence of the approximation of the weighted sum of chi-squares ?

The book contains a detailed description of this test and the basis for it. Your question is fully answered on pp 145-149. Determining degrees of freedom in $\chi^2$ tests is a subtle thing, because most of these tests are approximations (in the first place) and those approximations are good only when seemingly minor technical conditions apply. For some discussion of all this, see http://stats.stackexchange.com/a/17148. H&L took a purely practical route: they base their recommendation of $d-2$ DF on "an extensive set of simulations." — whuber, Aug 17 '15 at 14:45
@whubber: thanks, but I have been reading the article they refer to on these pages in the book (Hosmer and Lemeshow, 1980) and there it shows that is has something to do with the fact that the deciles are constructed using an estimated score. They refer to some theorem that gives a formula for the correction to degrees of freedom, and they find an empirical (i.e. Simulated) approximation for the correction formula. I would like to understand the formula or at least the 'intuitive' explanation for the reason for the correction. — , Aug 17 '15 at 15:10
It would help to reproduce that theorem here in your question, then. (I have only read the book; I haven't consulted the original papers they cite.) — whuber, Aug 17 '15 at 15:18
This test is now considered obsolete due to (1) lack of power, (2) binning of continuous probabilities, and (3) arbitrariness in choice of binning and choice of definition of deciles. The Hosmer - le Cessie 1 d.f. test or the Spiegelhalter test are recommended. See for example the R `rms` package `residuals.lrm` and `val.prob` functions. — Frank Harrell, Aug 18 '15 at 16:23
@Frank Harell: (a) even is the Hosmer-Lemeshow test is obsolete, I think it is still interesting to understand the difference with $\chi^2$ and (b) do you have a reference that shows that Spiegelhalter test has more power than the Hosmer-Lemeshow test ? — , Aug 18 '15 at 18:52
Far greater power. So does the Hosmer-le Cessie test (another single d.f. test). The Spiegelhalter test, though, is available only for independent sample validation. — Frank Harrell, Aug 18 '15 at 19:44
@Frank Harell: for my question (a) in my comment: I still would find it interesting to understand the difference with $\chi^2$ , for (b) can you give a reference that that shows that Spiegelhalter test has far greater power than Hosmer-Lemeshow test, albeit in the case of independent sample validation ? — , Aug 18 '15 at 20:19
I have not seen power simulations; I've just seen many examples with much greater statistical significant from Spiegelhalter's. More is known about the one d.f. Hosmer le Cessie sum of squared error goodness of test vs. the Hosmer-Lemeshow test - see http://www.citeulike.org/user/harrelfe/article/13264327 . See also: http://www.citeulike.org/user/harrelfe/article/13265727 — Frank Harrell, Aug 18 '15 at 20:44
@Frank Harell: (part 1) I don't think that **'greater power'** and **'greater statistical significance'** are the same thing ? I'am not sure what you mean with 'greater statistical signifincance' but if you mean lower *p-values* then I have two remarks (1) if I remember well, the Spiegelahlter test makes a strong assumption of normality, if that assumption is violated, then, as the p-values are computed under that assumption, there are doubts about these p-values and — , Aug 19 '15 at 04:09
@Frank Harell: (part 2) (2) Lower p-values do not imply higher significance, see my answer at http://stats.stackexchange.com/questions/166323/misunderstanding-a-p-value/166327#166327. I will read the paper on the Hosmer-le Cessie test, but I still **think it is interesting to understand the reason for the lower degrees of freedom.** — , Aug 19 '15 at 04:09
Correct, although the two are correlated. Still I believe that power simulations would strongly confirm my statement above. *But* a normality assumption is not important for the Spiegelhalter test. — Frank Harrell, Aug 19 '15 at 11:37
@Frank Harell: I will look up the paper by Spiegelhalter, but under equation (4.16) they talk about 'asymptotially normal' by the central ilmit theorem https://esc.fnwi.uva.nl/thesis/centraal/files/f1668180724.pdf — , Aug 19 '15 at 11:48
True; it's just that proportions are very well-behaved so convergence is faster than for non-limited-range statistical quantities. — Frank Harrell, Aug 19 '15 at 13:17
@Frank Harell: Is that based on a theorem ? Because that would imply that sums of Bernouilli's (0 or 1) with the same (or different success probabilities) would converge faster to normal than a sum of normal variables (any real value) ? (Moreover, the Spiegelhalter test is about sums of squares) — , Aug 19 '15 at 13:32
These issues are IMHO very small in comparison with the original question. — Frank Harrell, Aug 19 '15 at 13:43
@Frank Harell: You are right, but that does - in my opinion - not mean that the original question I asked is not worth an investigation; the fact that Hosmer-Lemeshow approximated a correction term by doing simulations might (it is to be investigated) be at the (partly) cause of some 'underperformances' ? It is not because it is obsolete that this question is irrelevant I think ? — , Aug 19 '15 at 14:00
Related to the edited posting that asks a new question, the $\chi^2$ approximation and which approximation to use for the d.f. have nothing to do with the major deficiencies of the H-L test. — Frank Harrell, May 28 '16 at 12:50
@Frank Harrell: can you be more precise about these deificiencies ? — , May 28 '16 at 13:18
I think details appear elsewhere on this site. Briefly, (1) Hosmer showed the test is arbitrary - is very sensitive to exactly how deciles are computed; (2) it lacks power. You can see that it is based on imprecise quantities by plotting the binned calibration curve (as opposed to a smooth calibration curve) and noting the jumps. Also, it does not properly penalize for extreme overfitting. — Frank Harrell, May 28 '16 at 13:28
I see that you have left the question as 'unanswered'. What is the main question that you still have? (Can't comment, posting as answer) — Math321, Aug 09 '17 at 18:15
@Math321: I think there are several questions (they are clear questions if I look at the votes) ? — , Aug 10 '17 at 06:36

Sextus Empiricus · Answer 1 · 2017-09-08T10:38:52.940

The theorem that you refer to (the usual reduction part "usual reduction of degrees of freedom due to estimated parameters") has been mostly advocated by R.A. Fisher. In 'On the interpretation of Chi Square from Contingency Tables, and the Calculation of P' (1922) he argued to use the $(R-1) * (C-1)$ rule and in 'The goodness of fit of regression formulae' (1922) he argues to reduce the degrees of freedom by the number of parameters used in the regression to obtain expected values from the data. (It is interesting to note that people misused the chi-square test, with wrong degrees of freedom, for more than twenty years since it's introduction in 1900)

Your case is of the second kind (regression) and not of the former kind (contingency table) although the two are related in that they are linear restrictions on the parameters.

Because you model the expected values, based on your observed values, and you do this with a model that has two parameters, the 'usual' reduction in degrees of freedom is two plus one (an extra one because the O_i need to sum up to a total, which is another linear restriction, and you end up effectively with a reduction of two, instead of three, because of the 'in-efficiency' of the modeled expected values).

The chi-square test uses a $\chi^2$ as a distance measure to express how close a result is to the expected data. In the many versions of the chi-square tests the distribution of this 'distance' is related to the sum of deviations in normal distributed variables (which is true in the limit only and is an approximation if you deal with non-normal distributed data).

For the multivariate normal distribution the density function is related to the $\chi^2$ by

$f(x_1,...,x_k) = \frac{e^{- \frac{1}{2}\chi^2} }{\sqrt{(2\pi)^k \vert \mathbf{\Sigma}\vert}}$

with $\vert \mathbf{\Sigma}\vert$ the determinant of the covariance matrix of $\mathbf{x}$

and $\chi^2 = (\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})$ is the mahalanobis distance which reduces to the Euclidian distance if $\mathbf{\Sigma}=\mathbf{I}$.

In his 1900 article Pearson argued that the $\chi^2$-levels are spheroids and that he can transform to spherical coordinates in order to integrate a value such as $P(\chi^2 > a)$. Which becomes a single integral.

It is this geometrical representation, $\chi^2$ as a distance and also a term in density function, that can help to understand the reduction of degrees of freedom when linear restrictions are present.

First the case of a 2x2 contingency table. You should notice that the four values $\frac{O_i-E_i}{E_i}$ are not four independent normal distributed variables. They are instead related to each other and boil down to a single variable.

Lets use the table

$O_{ij} = \begin{array}{cc} o_{11} & o_{12} \\ o_{21} & o_{22} \end{array}$

then if the expected values

$E_{ij} = \begin{array}{cc} e_{11} & e_{12} \\ e_{21} & e_{22} \end{array}$

where fixed then $\sum \frac{o_{ij}-e_{ij}}{e_{ij}}$ would be distributed as a chi-square distribution with four degrees of freedom but often we estimate the $e_{ij}$ based on the $o_{ij}$ and the variation is not like four independent variables. Instead we get that all the differences between $o$ and $e$ are the same

$ \begin{array}\\&(o_{11}-e_{11}) &=\\ &(o_{22}-e_{22}) &=\\ -&(o_{21}-e_{21}) &=\\ -&(o_{12}-e_{12}) &= o_{11} - \frac{(o_{11}+o_{12})(o_{11}+o_{21})}{(o_{11}+o_{12}+o_{21}+o_{22})} \end{array}$

and they are effectively a single variable rather than four. Geometrically you can see this as the $\chi^2$ value not integrated on a four dimensional sphere but on a single line.

Note that this contingency table test is not the case for the contingency table in the Hosmer-Lemeshow test (it uses a different null hypothesis!). See also section 2.1 'the case when $\beta_0$ and $\underline\beta$ are known' in the article of Hosmer and Lemshow. In their case you get 2g-1 degrees of freedom and not g-1 degrees of freedom as in the (R-1)(C-1) rule. This (R-1)(C-1) rule is specifically the case for the null hypothesis that row and column variables are independent (which creates R+C-1 constraints on the $o_i-e_i$ values). The Hosmer-Lemeshow test relates to the hypothesis that the cells are filled according to the probabilities of a logistic regression model based on $four$ parameters in the case of distributional assumption A and $p+1$ parameters in the case of distributional assumption B.

Second the case of a regression. A regression does something similar to the difference $o-e$ as the contingency table and reduces the dimensionality of the variation. There is a nice geometrical representation for this as the value $y_i$ can be represented as the sum of a model term $\beta x_i$ and a residual (not error) terms $\epsilon_i$. These model term and residual term each represent a dimensional space that is perpendicular to each other. That means the residual terms $\epsilon_i$ can not take any possible value! Namely they are reduced by the part which projects on the model, and more particular 1 dimension for each parameter in the model.

Maybe the following images can help a bit

Below are 400 times three (uncorrelated) variables from the binomial distributions $B(n=60,p={1/6,2/6,3/6})$. They relate to normal distributed variables $N(\mu=n*p,\sigma^2=n*p*(1-p))$. In the same image we draw the iso-surface for $\chi^2={1,2,6}$. Integrating over this space by using the spherical coordinates such that we only need a single integration (because changing the angle does not change the density), over $\chi$ results in $\int_0^a e^{-\frac{1}{2} \chi^2 }\chi^{d-1} d\chi$ in which this $\chi^{d-1}$ part represents the area of the d-dimensional sphere. If we would limit the variables $\chi$ in some way than the integration would not be over a d-dimensional sphere but something of lower dimension.

The image below can be used to get an idea of the dimensional reduction in the residual terms. It explains the least squares fitting method in geometric term.

In blue you have measurements. In red you have what the model allows. The measurement is often not exactly equal to the model and has some deviation. You can regard this, geometrically, as the distance from the measured point to the red surface.

The red arrows $mu_1$ and $mu_2$ have values $(1,1,1)$ and $(0,1,2)$ and could be related to some linear model as x = a + b * z + error or

$\begin{bmatrix}x_{1}\\x_{2}\\x_{3}\end{bmatrix} = a \begin{bmatrix}1\\1\\1\end{bmatrix} + b \begin{bmatrix}0\\1\\2\end{bmatrix} + \begin{bmatrix}\epsilon_1\\\epsilon_2\\\epsilon_3\end{bmatrix} $

so the span of those two vectors $(1,1,1)$ and $(0,1,2)$ (the red plane) are the values for $x$ that are possible in the regression model and $\epsilon$ is a vector that is the difference between the observed value and the regression/modeled value. In the least squares method this vector is perpendicular (least distance is least sum of squares) to the red surface (and the modeled value is the projection of the observed value onto the red surface).

So this difference between observed and (modelled) expected is a sum of vectors that are perpendicular to the model vector (and this space has dimension of the total space minus the number of model vectors).

In our simple example case. The total dimension is 3. The model has 2 dimensions. And the error has a dimension 1 (so no matter which of those blue points you take, the green arrows show a single example, the error terms have always the same ratio, follow a single vector).

I hope this explanation helps. It is in no way a rigorous proof and there are some special algebraic tricks that need to be solved in these geometric representations. But anyway I like these two geometrical representations. The one for the trick of Pearson to integrate the $\chi^2$ by using the spherical coordinates, and the other for viewing the sum of least squares method as a projection onto a plane (or larger span).

I am always amazed how we end up with $\frac{o-e}{e}$, this is in my point of view not trivial since the normal approximation of a binomial is not a devision by $e$ but by $np(1-p)$ and in the case of contingency tables you can work it out easily but in the case of the regression or other linear restrictions it does not work out so easily while the literature is often very easy in arguing that 'it works out the same for other linear restrictions'. (An interesting example of the problem. If you performe the following test multiple times 'throw 2 times 10 times a coin and only register the cases in which the sum is 10' then you do not get the typical chi-square distribution for this "simple" linear restriction)

In my honest opinion this answer has very nice figures and arguments that are related to $\chi^2$ test but it has not so much to do with the question which is about the Hosmer-Lemeshow test for a logistic regression. You are arguing something with a regression where 1 parameters is estimated but the Hosmer-Lemeshow test is about a logistic regression where $p>1$ parameters are estimated. See also https://stats.stackexchange.com/questions/296312/hosmer-lemeshow-1980-paper-theorem-2/296326?noredirect=1#comment563703_296326 — , Sep 07 '17 at 14:00
... and, as you say, you end up with an $e$ in the denominator and not with a $np(1-p)$ , so this does not answer this question. Hence I have to downvote, sorry (but the graphs are very nice :-) ). — , Sep 07 '17 at 14:07
You were asking in a comment for "to understand the formula or at least the 'intuitive' explanation". So that is what you get with these geometrical interpretations. To calculate exactly how these $np(1-p)$ cancel out if you add both the positive and negative cases is far from intuitive and does not help you understand the dimensions. — Sextus Empiricus, Sep 07 '17 at 14:37
In my answer I used the typical $(d - 1 - p)$ degrees of freedom and assumed that the regression was performed with one parameter (p=1), which was a mistake. The parameters in your references are two, a $\beta_0$ and $\beta$. These *two* parameters would have reduced the dimensionality to d-3 if only the proper conditions (efficient estimate) would have been met (see for instance again a nice article from Fisher 'The conditions under which the chi square measures the discrepancy between observation and hypothesis').... — Sextus Empiricus, Sep 07 '17 at 14:57
....anyway, I explained why we don't get dimension d-1 (and should instead expect something like d-3, if you put two parameters in the regression) and how the dimensional reduction by an efficient estimate can be imagined. It is the Moore-Spruill article that works out the extra terms (potentially increasing the effective degrees of freedom) due to that inefficiency and it is the Hosmer-Lemeshow simulation that shows that d-2 works best. That theoretical work is far from intuitive and the simulation is far from exact. My answer is just the requested explanation for the difference with d-1. — Sextus Empiricus, Sep 07 '17 at 14:59
Your answer is about the $\chi^2$ test and not about the Hosmer-lemeshow test. I think the best way to proceed is that you read the HL-paper and I Am sure that you will conclude that what you write in your answer does not provide an answer to my question. — , Sep 07 '17 at 15:13
You have to help me a bit here. The HL-paper introduction reads: *"Several test statistics are proposed for the purpose of assessing the goodness of fit of the multiple logistic regression model. The test statistics are obtained **by applying the chi-square test** for a contingency table in which the expected frequencies are determined using two different grouping strategies and two different sets of distributional assumptions"* Yes that part after 'in which the...' is the HL-test part, but it is under the hood a chi-square test. — Sextus Empiricus, Sep 07 '17 at 15:19
Ah you have the paper, so can you write what is under theorem 2 and explain that: because it is under the 'hood of' but that is far from saying that it is a chi square isn't it? But we come closer: can you explain their theorem 2 and use that to answer my question? — , Sep 07 '17 at 15:59
I am not sure what culprit you currently want to tackle. A) The issue why this statistic does not follow simply $g-1$ like a typical gx2 contingency table. B) Or, the issue with 'the usual reduction in degrees of freedom' (which leads to $\chi^2(2g-g-(p+1))$, g for the correlations between positive and negative cases, p+1 for the fitting parameters) C) Or the issue with this weighted sum of random variables term $\sum \lambda_i \chi_i^2(1)$ related to the fitted parameters becoming a $\chi^2(p-1)$ contribution. How this can be interpreted intuitively. — Sextus Empiricus, Sep 07 '17 at 16:13
Well if I look at the votes for the question it seems that it is clear to many of the visitors. Maybe you can tackle them all ? But I think you can see when you look at that paper that your answer is not an answer to this question. — , Sep 07 '17 at 16:35

score 2 · Answer 2 · 2017-09-08T06:43:51.197

2

Hosmer D.W., Lemeshow S. (1980), A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics, A10, 1043-1069 show that:

If the model is a logistic regression model and the $p$ parameters are estimated by maximum likelihood and the $G$ groups are defined on the estimated probabilities then it holds that $X^2$ is asymptotically $\chi^2(G-p-1)+\sum_{i=1}^{p+1} \lambda_i \chi_i^2(1)$ (Hosmer,Lemeshow, 1980, p.1052, Theorem 2).

(Note: the necessary conditions are not explicitly in Theorem 2 on page 1052 but if one attentively reads the paper and the proof then these pop up)

The second term $\sum_{i=1}^{p+1} \lambda_i \chi_i^2(1)$ results from the fact that the grouping is based on estimated - i.e. random - quantities (Hosmer,Lemeshow, 1980, p.1051)

Using simulations they showed that the second term can be (in the cases used in the simualtion) approximated by a $\chi^2(p-1)$ (Hosmer,Lemeshow, 1980, p.1060)

Combining these two facts results in a sum of two $\chi^2$ variables, one with $G-p-1$ degrees of freedom and a second one with $p-1$ degrees of freedom or $X^2 \sim \chi^2(G-p-1+p-1=G-2)$

So the answer to the question lies in the occurrence of the 'weighted chi-square term' or in the fact that the groups are defined using estimated probabilities that are themselves random variables.

See also Hosmer Lemeshow (1980) Paper - Theorem 2

edited Sep 08 '17 at 06:43

answered Sep 08 '17 at 06:35

'So the answer to the question lies in the occurrence of the 'weighted chi-square term' *and* in the fact that the groups are defined using estimated probabilities that are themselves random variables.' **A**) The estimated probabilities makes that you get an *extra* reduction of p+1, which makes the main difference to the case of the contingency table (in which only g terms are estimated). **B**) The weighted chi-square term occurs as a correction because the estimate is not a likelihood estimate or equally efficient, and this makes that the effect of the reduction is *less* extra than (p+1). – Sextus Empiricus Sep 08 '17 at 08:01
@Martijn Weterings: Am I right if I conclude that what you say in this comment is not exactly the same explanation (not to say completely different) as what you say in your answer ? Does your comment lead to the conclusion that the df are $G-2$ ? – Sep 08 '17 at 08:09
My answer explains the intuition behind the difference in degrees of freedom compared to the reasoning based on "the test-statistic for the $\chi^2$ test for this contingency table", it explains *why* they are different (case estimating fixed cells). It focuses on the 'usual reduction' from which you would conclude that the df would be G-3. However, certain conditions for the 'usual reduction' are not met. For this reason (random cells) you get the more complicated terms with the weighted chi-square term as a correction and you effectively end up with G-2. It is far from completely different. – Sextus Empiricus Sep 08 '17 at 08:17
@ Martijn Weterings, sorry but I can't upvote because I don't see any notion like 'random cells' in your answer at all, do you mean that al your nice pictures (and I mean this, they are very nice) explain something about 'random cells' or did you come up with that notion after reading my answer ? – Sep 08 '17 at 08:40
Don't be sorry. I agree that my answer is not an exact answer to show exactly the degrees of freedom in the HL test. **I** am sorry for that. What you have is Chernoff Lehman statistic (with also random cells) that follows a $\sum_{i=1}^{k-s-1} \chi^2(1) + \sum_{i=k-s}^{k-1} \lambda_i \chi_i^2(1) $ distribution. It is currently unclear to me what part is troubling you, I hope you can be more constructive in this. If you want all explained, you already have the articles for that. My answer just tackled the $\sum_{i=1}^{k-s-1} \chi^2(1)$ explaining the main difference to contingency table test. – Sextus Empiricus Sep 08 '17 at 09:58
@Martijn Weterings: if you can develop this 'Chernoff Lehman' statistic in the context of the HLT test i.e. for a logistic regression with $p$ parameters estimated and the deciles defined on the predicted probabilities, giving the $G-2$ degrees of freedom, then I will accept your answer because I am really interested in that – Sep 08 '17 at 13:14

Degrees of freedom of $\chi^2$ in Hosmer-Lemeshow test

EDIT: additions after reading comments:

2 Answers2

Linked