$\chi^2$-test for good calibration of a rating system; degrees of freedom

Question

Assume that I have a rating system with $R$ rating classes, the classes have a probability of default $p_{i0}, i=1,2..,R$. The probability of default is the probability that a company in that class has a payment problem within one year.

I have a set of companies that I have to partition over these classes, using one or other model.

Let's say that I classified $N$ companies last year and each rating class contains $N_i$ companies where $\sum_{i=1}^R N_i=N$. Today, one year later, I count the observed number of defaults in each class and I find $d_i$.

I want to test whether my system is well calibrated, i.e. whether these observed numbers of defaults are in line with my a priori fixed probabilities $p_{i0}$.

I define my test statistic as $X^2=\sum_{I=1}^R \frac{(d_i-N_ip_{i0})^2}{N_i p_{i0}}$. If the expected cell counts $N_ip_{i0}$ are not too low and if the defaults in different rating classes are independent, then this test statistic is $\chi^2$, but what is the number of degrees of freedom , is it $R$ (because I have a sum of $R$ squared standard normal random variables that are independent) or is it $R-1$ as in pearson's $\chi^2$ ? If the latter is the case; where do I loose one degree of freedom ( a reference is fine also).

Wouldn't it be easier to simulate it? You have a model of the data generating process (DGP). You have what you think is a realization of this DGP. Use the probabilities that were fixed a priori to draw a random sample of the same size as the one that you're interested. Compute the statistic. And repeat this a 1000,000 times. Is your statistic measuring the discrepancy between the model and the realization larger or smaller than the threshold value beyond which you say that it isn't well-calibrated? — , May 23 '17 at 12:11
I believe my account of the chi-squared test at https://stats.stackexchange.com/a/17148/919 completely answers your question. It also provides a standard reference (Kendall & Stuart). — whuber, May 24 '17 at 18:14
@whuber: It is a very good answer and well written, but I don't see in that answer whether it is $R$ or $R-1$ degrees of freedom ? — , May 25 '17 at 08:34
If I may be so bold as to quote from my own post, and substitute your numbers and variable names for the names used there, it reads "I know in advance that the sum of the counts must equal $N$. That's one relationship. I estimated zero parameters from the data. That's zero additional relationships, giving $0+1=1$ total relationships. Presuming they (the parameters) are all (functionally) independent, that leaves only $R−0−1 =R-1$ (functionally) independent 'degrees of freedom': that's the value to use for $\nu$." It remains to you to verify the conditions I enumerated after that passage. — whuber, May 25 '17 at 21:38
@whuber: the sum of the counts is not known in advance I think ? In each class you have counted $d_i$ defaults and each $d_i$ is an outcome of a Binomial random variable $D_i\sim Bin(p_{i0},N_i)$, so I think that the sum $\sum d_{i}$ is random and not not known ? I don't think that this should equal $N$ ? — , Jun 01 '17 at 18:15
I cannot see where your question specifies the counts are not known. Could you please state that explicitly? As it currently reads, you tell us there are $N_i$ companies in each rating class and $d_i$ corresponding defaults: all counts appear to be given and known. *Of course* the sum of the $d_i$ is a random variable, because all the $d_i$ are: that's a basic part of any contingency or frequency table situation like this one. — whuber, Jun 01 '17 at 18:28
@whuber: well let's put in in another way: you say that I have one functional relationship; you say that you know in advance that the sum of the counts equals $N$, which counts do you mean? because the counts in the $\chi^2$ test are defintily the $d_i$, it is the sum of 'obsverved counts' ($d_i$) minus 'expected counts' ($Np_{i_0}$) squared divided by the expected counts ? So the counts are $d_i$ ? Or aren't they ? So the counts are known, but that does not imply a functional relationship or do I miss something here ? — , Jun 01 '17 at 19:19
I think I see what you're getting at. Let me remark that since you appear to have $R$ independent Binomial observations, then provided all $p_i$ are small, you may view each term $(d_i-N_ip_i)/\sqrt{N_ip_i}$ as a standardized value--a "z score"--and then your statistic looks like a sum of squares of $R$ independent z-scores: it ought to have a $\chi^2(R)$ distribution, approximately. These considerations also suggest modifying your statistic to be the sum of $(d_i-N_ip_i)^2/(N_ip_i(1-p_i))$, which looks more appropriate for this Binomial model, since the variance of $d_i$ is $N_ip_i(1-p_i)$. — whuber, Jun 01 '17 at 20:25
@whuber: well I think we are getting closer, I fully agree with what you say, but my question was about the degrees of freedom, so if $X^2=\sum_i ^R\frac{(d_i - N_i p_i)^2}{N_i p_i (1-p_i)}$ has a $\chi^2$ distribution, then the degrees of freedom is $R$ not $R-1$ as you said before ? Is that right ? — , Jun 01 '17 at 20:48

score 0 · Accepted Answer · answered May 24 '17 at 15:47

This is an elaboration on the comment made earlier.

If I've understood the problem correctly then you have a model and you want to know whether this model is a good model of reality. The model you have is a model of company defaults. There are $R-$different kinds or classes of companies. Each class consists of $N_i-$companies. And each company in these $R-$classes has a probability of $p_i$ of defaulting in a year. You know the number of companies in each class and you have a guess as to what $p_i$ is. And what you ask when you ask whether it's a good model is whether these probabilities are good guesses.

Now to measure whether this guess as to what the probabilities are good you quantify the discrepancy between the model and reality by taking the squared difference between what was expected under the model and what actually happened divided by what was expected under the model. This statistic follows a $\chi^2-$distribution. You don't know however which one.

What I am suggesting is to treat your model as the null hypothesis. Assume that the model is true and that for some reason you've managed to guess the right probabilities. How would the distribution of this statistic look like if the model were true? To find out you:

Run $R-$ binomial trials of size $N_i$ with probability of success $p_i$ with $i=1,2,..,R$. These are your classes and defaults;
Divide the number of successes by the number of trials for each class separately. This is $N_i p_i$ in the statistic.
Compute the statistic by comparing this simulation to the model and store it;
Repeat steps 1 - 3 a large number of times e.g. 1,000,000;
See what proportion of these simulated statistics are larger than the statistic you've computed by comparing your model to reality. If this proportion is smaller than say 5% or some other threshold value, then you have a statistically significant result.

On a side note: why are you using this statistic? Why not quantify how large the losses are from acting on this model as if it were true or how much money is lost through defaults and do steps 1-5 for that. This would seem more meaningful to me as this is what you're trying to avoid, correct?

$\chi^2$-test for good calibration of a rating system; degrees of freedom

1 Answers1