2

We know that Chi-Square can be used with categorical data (such as Male/Female, Republican/Democrat, etc). However, I convert my original continuous data to categorical variables and then use Chi-square analysis on it. Is the approach correct? It has been pointed out to me at continuous vs categorical logistic regression for marks and admission that using such an approach with logistic regression may be flawed. Not sure if Chi-square is also affected the same way.

The actual scenario is given below (can also be seen on the mentioned weblink) :

I have a list of marks scored by students in Science (X, between 0 to 100%) and whether they went to college to or not (Y).
High marks in science showed a higher concentration of college admits and low marks had the second best hit rate (students went for Arts degree, etc). Scores in the intermediate range had a lower hit rate. Most students have lower scores in Science.

I divided my sample into

5 bins: 0-10, 10-20, 20-80, 80-90, 90-100. Found Chi-Sq to be significant. My main question is - Is categorizing such marks into bins correct?

Maddy
  • 708
  • 3
  • 7
  • 16
  • 1
    Because the $\chi^2$ distribution and associated tests shows up in so many different applications, please tell us *how* you plan to use it. It would also be helpful to know how you selected the cutoffs for binning your continuous variable, because that often affects the test. – whuber Apr 30 '14 at 21:48
  • It is rarely correct to convert continuous data to categorical data. It loses information. – Peter Flom Apr 30 '14 at 21:51
  • @PeterFlom Conventionally speaking you are right. But in my distribution, most of the universe is between say 0-5 but most of the events are between -5 to 0 bin. Chi-Square helps me show that there is an association between these bins and events. Again, I'm not sure if categorizing them was a good decision to begin with. Should I simply drop the chi-square analysis? – Maddy Apr 30 '14 at 22:07
  • Maybe quantile regression is something to look at in your case. – Aksakal Apr 30 '14 at 22:13
  • 4
    @Peter True, but there is a distinction between "correct" (the word used in the question) and "useful" (which is what I think is really being asked). When you use a less powerful procedure (*i.e.*, one that "loses information") and it still indicates a significant difference (as suggested, but not stated, in the question), then you have what you need to draw a conclusion. Perhaps of more import at that point is how one would measure the size of the effect: using irregularly spaced bins--especially ones adapted to the data, as here--can create a real challenge in that regard. – whuber Apr 30 '14 at 22:21
  • Is the dependent variable is dichotomous (wants to go college, doesn't want to) or ordinal or what? You might want to look at splines, regardless. – Peter Flom May 01 '14 at 01:58
  • 1
    @PeterFlom I have found some examples where continuous data was binned to create categorical data (http://www.stat.yale.edu/Courses/1997-98/101/chigf.htm). My point is I think Chi-Square is used to compare predicted vs observed distribution. Continuous data can be binned to obtain predicted & observed values. – Maddy May 01 '14 at 16:54
  • 1
    It certainly *can* be, but should it be? – Peter Flom May 01 '14 at 16:57
  • 1
    @PeterFlom As long as `can be` approach is correct, I should be fine. My final decision is not based on Chi-square analysis but instead on Logistic regression. Chi-square is used just to begin the analysis by showing bin levels and events have some sort of relationship. I'm aware that I can skip it totally and jump directly to LR, but I'd prefer to keep this approach. Thanks. – Maddy May 01 '14 at 18:39

0 Answers0