Dividing a sample based on the value of y would be problematic?

Question

Is it a well known fact in basic econometric that, if one divides a sample based on the value of Y (Dependent variable), it creates a number of problems?

My dependent variable is financial rating scores,ranging from 0 to 17, and I would like to split my sample into 1) the safer group with a score greater than 10 and 2) the riskier group with a score less than 10. I believe this partitioning doesn't create any econometric problems. However, I've been told that it is a common sense that dividing a sample based on the value of Y (NOT X: independent variable) creates a number of difficulties, but I can't make sense of it.

Would you please kindly explain to me:

Whether it creates any problems.
If so, what econometric problems do we face?
Any references to support the arguments?

For an example see http://stats.stackexchange.com/questions/237503/degrees-of-freedom-of-the-hosmer-lemeshow-test-statistic-g-or-g-2 — , Sep 30 '16 at 08:41

jwimberley · Answer 1 · 2016-09-26T02:32:10.343

1

Your question is quite broad and vague, but here is what you should consider: This depends entirely on how you are calculating the dependent score $Y$. If $Y(X)$ is a rule that you had even before you got your sample of financial scores, e.g. $Y = X^2$ or $Y = aX+b$ where you already know $a$ and $b$, then dividing based on the value of $Y$ is the same as dividing based on the value of $X$. It's the same information, just by a different name.

On the other hand, if you take your data sample and perform a statistical analysis to create a way of calculating $Y$, e.g. by performing a fit using your data and then using the predicted values from the fit, you could have problems. This is the situation described in @fcop's answer, where the specific case of logistic regression is considered, and perhaps this is the context in which you've been warned of possible danger.

Here are two small simulations contrasting the two scenarios, although its doubtful that they apply directly to your case. In the first, you have an independent variables $x$ and a dependent variable $y$ which is either true or false. Previous research has shown that the probability that $y$ is true is $\pi = 1/(1+\exp(-[0.1+0.5x]))$. Assuming this is exactly true, the following code simulates 10 thousand cross-checks where this prediction is applied to new data and checked against the known values $y$, at each step calculating the Hosmer-Lemeshow statistic. The result follows a chi-square distribution with 10 degrees of freedom -- one for each partition used in the test:

library(ResourceSelection)
n <- 500; chisqs <- c()
for (i in 1:10000) {
    x <- rnorm(n)
    p <- plogis(0.1+0.5*x)
    y <- rbinom(n,1,p)
    x2 <- hoslem.test(y,p)$statistic
    chisqs <- cbind(chisqs,x2);
}
h <- hist(chisqs,50)
xs <- seq(0,25,0.01)
ys <- dchisq(xs,df=10)
lines(xs,ys*h$counts/h$density)

On the other hand, if you use the values of $y$ to make a fit and create the predictions, you'll clearly do better because you've peeked at the answers. In this case the degrees of freedom in the distribution of the HL statistic are 8:

library(ResourceSelection)
n <- 500; chisqs <- c()
for (i in 1:10000) {
    x <- rnorm(n)
    p <- plogis(0.1+0.5*x)
    y <- rbinom(n,1,p)
    fit <- glm(y~x)
    x2 <- hoslem.test(y,fitted(fit))$statistic
    chisqs <- cbind(chisqs,x2);
}
h <- hist(chisqs,50)
xs <- seq(0,25,0.01)
ys <- dchisq(xs,df=8)
lines(xs,ys*h$counts/h$density)

This illustrates one scenario in which something like the danger you've described occurs. You'll need to decide what category your case falls into.

edited Sep 26 '16 at 02:32

answered Sep 26 '16 at 01:53

jwimberley

3,679
2
11
20

I start with an example on linear regression in my answer, the point is not whether it is logistic regression or something else, the point is that when $y$ is random ( e.g. because of epsilon) then you partition on a random outcome, so then the partition is also random – Sep 26 '16 at 05:37
@fcop But you don't partition on a random outcome in the HL test. You partition on quantiles of $\hat y = a + bx$, which is fixed given $x$. – jwimberley Sep 26 '16 at 15:04
I think you should write $\hat{y}=\hat{a}+\hat{b}x$ where $\hat{a}$ and $\hat{b}$ are estimated from a sample. If you draw another sample you will get other values for $\hat{a}$ and $\hat{b}$ or these quantities are random. To say it in other words: given $x$, the value of $\hat{y}$ is still random because $\hat{a}$ and $\hat{b}$ are... I think it's time to give the reference to the HL paper where you found that the degrees of freedom are different from $g-2$. In your comments to my answer you say that you have read such a paper, please give the reference – Sep 26 '16 at 20:25
@fcop I don't recall if this is discussed in any of HL's papers, but it doesn't matter because its a basic principle. You are laser-focused on performing the HL goodness of fit test to a logistic regression, studying the relationship between `y` and `x` in a dataset where the outcome `y` is known. You're ignoring that the entire point of the regression is to make *predictions* in new datasets where `x'` is known but `y` is not. My point, demonstrated above, is that the effects you describe are relevant only to the GOF test (irrelevant to the question) and not when making predictions. – jwimberley Sep 27 '16 at 18:00
@fcop Regarding the use of the hat and that y, a, and b are random: You're right, and my notation wasn't clear. But they are random in different senses. `a` and `b` have true, but unknown, fixed values, while y is a random variable sampled many times. You could say that `a` and `b` are random in a Bayesian sense and `y` in a frequentist sense. I did gloss over this. You can modify my code to fit for a and b and make predictions at each loop iteration, and you'll get a distribution with neither $g-2$ nor $g$ d.o.f. but instead somewhat more than $g$, depending on the precision of the fits. – jwimberley Sep 27 '16 at 18:07
@jwimberly: you don' t seem to remember the reference of that paper do you? You turn around an easy question: you say you read a paper, just give me the reference (and please don't say I am rude because I ask it more than once). And your simulations are nice r code but don't have anything to do with the question. I am sorry to have to say that. – Sep 27 '16 at 18:33
@jwimberly: just as an example, in a linear regression, under ' usual' assumptions, then estimated coefficients are (normal) random variables, also in a ' frequentist' framework. So if you partition on $\hat{y}$ then you partition on a random outcome, so your partition will be random. If you think that is not the case then I think we fundamentally disagree and discussing makes no sense. Nevertheless I am still interested in the reference on HL, so please stop being mysterious about that and give it. – Sep 27 '16 at 19:07
@fcop I'm confused; I never said that I read this in a paper. I might have read about this in Agresti, possibly, but it doesn't matter because you can verify it yourself. Performing the HL test to a cross-validation sample isn't much discussed in the literature - I brought it up only to make the point that the # d.o.f. difference is due to overfitting in the training, not the nature of predictions. This is the same as how the MSE in linear regression is different between the training sample and a validation sample (hence adjusted R2). I am frustrated by this conversation and I'm done with it. – jwimberley Sep 27 '16 at 19:51
if you really want to know where the $g-2$ degrees of freedom comes from then there is only one way to find out: take the proof given by Hosmer and lemeshow and analyse that proof to see where the $g-2$ comes in. I did that and I strongly advise you to do the same because if you do that then you will see that it is due to the fact that the partition is based on the predicted probabilities. All other things, like your explanation with training sets only exist in your own mind, this is what you confirm in your previous comment: no paper on that. So the only way to – Sep 28 '16 at 03:08
find out is to read the HL paper. If you are convinced that the proof by HL should lead to other degrees of freedom then my proposal is that you yourself write a paper on that, you seem to be convinced about your idea so publish it. I fear that your dream may not come out, but there is nothing wrong in going on dreaming. However I don't think it s a good idea if you base your own research on an unconfirmed creation that exists only in your own dream world. – Sep 28 '16 at 03:12
So look at the Hosmer lemeshow paper and discover a whole new world where they say that the degrees of freedom of their test statistic is $\chi^2(g-p-1)÷\sum_{i=1}^p \lambda_i \chi^2(1)$. Where the first term is similar to Pearson chi square goodness of fit and the second term is a consequence of using predicted probabilities that are random variables. Using simulations they show that the second term can be approximated by $\chi^2(p-1)$. Then the $g-2$ df come from $g-p-1+p-1$. The latter is obviously $g-2$ – Sep 28 '16 at 03:21
You seem to believe that for one or another reason ( not in a published paper) $g-p-1+p-1$ depends on whether you use a training sample or not. Well everyone is free to believe what he/she wants, but without further arguments I think it is equal to $g-2$ and I also think that one should stick to what Hosmer and lemeshow said and have proven. – Sep 28 '16 at 03:30
It would be good to insert the simulations from our other discussion. It will show the effect of partitioning on y – Sep 30 '16 at 14:18

score 0 · Answer 2 · 2016-09-25T08:52:39.797

0

If your regression is like e.g. $y=\beta_0 + \beta_1 x + \epsilon$ where $\epsilon$ has the usual assumptions, then you can partition on $x$ because it is known, but $y$ is a random quantity (because of the error term $\epsilon$) so if you partition on $y$ it could be that sometimes a subject changes class because of randomness (i.e. the realisation of the random $\epsilon$).

The model $y=\beta_0 + \beta_1 x + \epsilon$ tells you that $y$ is a random quentity, it est, for a given value of $x$ you know the distribution of $y$: $y|_x \sim N(\beta_0 + \beta_1 x,\Sigma)$, so $y$ can change from one experiment to another (because $y$ 'is a distribution, not a value'). If you partition on a value that can change from one experiment to another, obviously your partition also becomes random.

An example of the consequences can be seen when analysing the Hosmer-Lemeshow test for a logistic regression model, this is a $\chi^2$-test but if you look at the degrees of freedom of the Hosmer-Lemeshow test you will see that it is unusual. This is because for computing the test statistic you define 10 groups based on the predicted (and thus random) probabilities of a logistic regression model.

Other examples can be found in Greene, Econometric Analysis, where the author analyses the consequences of choice based sampling for logistic regression.

edited Sep 25 '16 at 08:52

answered Sep 24 '16 at 13:56

Regarding the Hosmer-Lemeshow test: What do you mean that the degrees of freedom in the test are unusual? For $g$ groups the d.o.f. are $g-2$, but this has to do with the 2 parameters (intercept and slope) fixed to the data in the logistic regression. If you perform the HL test of the fit to a different dataset, for cross-validation, the d.o.f. are $g$. And if $x$ is 1D, quantiles are $y$ are identical to quantiles of $x$. – jwimberley Sep 25 '16 at 12:19
@jwimberly: I don' t agree, and I strongly advise you to read the paper of Hosmer and lemeshow – Sep 25 '16 at 15:21
I have read the paper, and several others by those authors. I've just run a quick R simulation confirming my above claims: `library(ResourceSelections); n – jwimberley Sep 25 '16 at 15:39
To elaborate, the problem is not that *predicted* random probabilities are used in the grouping but instead that when the regression is applied to the dataset used to produce it, it is expected to be "overfit" and the probabilities are slightly *better* than random. – jwimberley Sep 25 '16 at 15:44
@jwimberley: you really should read the paper that Hosmer and lemeshow wrote because what you say is simply wrong. Have you read the paper? – Sep 25 '16 at 17:38
I answered that question already. Perhaps my original comment was unclear; if you tell me how you interpret it I can clarify. If you haven't already please take a look at the R simulation I provided: it conclusively shows that the decrease in the number of degrees of freedom from $g$ to $g-2$ is true *only* on the dataset to which the regression is performed. Thus the fact that $y$ is a predicted variable is not the deciding factor in the number of degrees of freedom. – jwimberley Sep 25 '16 at 18:34
@jwimberley: well if you have read the paper then you know that the test statistic is $\chi^2(n-p-1) +\sum_{i=1}^p \lambda_i \chi^2(1)$ and then they show using simulations that $\sum_i \lambda_i \chi^2(1)$ is approximately $\chi^2(p-1)$ and most importantly $\sum_i \lambda_i \chi^2(1)$ comes from the fact that groups are defined using predicted probabilities. You really should read ut – Sep 25 '16 at 19:16
@jwimberley: and in your simulation ( I hope you explain that simulation in detail in an answer) you have nothing that might come close to *predicted* probabilities, for a prediction you should first estimate something. It is not because you compute p with a logistic function that they are predicted. – Sep 25 '16 at 19:35
I very well might be mistaken but please do not be rude to me -- I told you that I read the paper after you asked the first of three times. And after your most recent comment, I think we are using the words "predicted" in different senses. As I stated, my simulation models a case where the regression has been performed on a *different* sample (yielding intercept 0.1 and slope 0.5) and is being used to make predictions on a *new* sample. I would have called the values of $y(x)$ for the fit sample retrodictions. I admit my usage may be non-standard and the confusion on my part. – jwimberley Sep 25 '16 at 20:39
@jwimberley I told you were exactly in the paper the $g-2$ can be found. In the book applied logistic regression, by the same authors ( the PDF can be found on internet) they say the degrees of freedom are $g-2$ you can check that. So could you please give me a reference where Hosmer and lemeshow state that it is not $g-2$ and under what conditions? And I am not rude, I am just asking you to read that paper. I have it and if you think that it s not due to the predicted probabilities then you must have read a different paper, please give the reference of that one – Sep 26 '16 at 03:42
see also http://stats.stackexchange.com/questions/237503/degrees-of-freedom-of-the-hosmer-lemeshow-test-statistic-g-or-g-2 – Sep 30 '16 at 08:41

Dividing a sample based on the value of y would be problematic?

2 Answers2

Linked