0

First of all, thank you for such a warm welcome to this forum. I am doing a study to see what diagnoses are the most common in a specific group (1) compared to another group (2). I want to show that there is a statistical difference in certain diagnoses between these two groups. I plan on doing this in SPSS but I don't know how I should arrange the data in order to perform the chi-2 test.

To clarify, the picture below is just an example of how it looks. Example: Patients in group 1 (column 2) in row A shows how many patients in group 1 that have received the diagnosis A. Patients in group 2 (column 3) in row C shows how many patients in group 2 that have received the diagnosis C.

Does anyone know if it is appropriate to do a chi-2 test here? Does anyone know how I should organize the data in order to perform it? I've been trying to follow instructions I do not get a result that looks alright.

I should add also that I have more than 200 different diagnoses but only 2 groups (as in the example). The two groups are also very different in size.

enter image description here

EDIT: I thought maybe I could do something like they did in this study and this specific table:

enter image description here

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6019239/

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    You could do a chisquare test, but it is probably not a good analysis of your data. With a large contingency table like yours, there must be substructure, like diagnoses clustering with related ones. First maybe look into visualization, see https://stats.stackexchange.com/questions/147721/which-is-the-best-visualization-for-contingency-tables and tell us more about your specific objectives. – kjetil b halvorsen Nov 18 '20 at 20:34

1 Answers1

0

In order to analyze the data properly it's crucial to know whether the counts in your data are based on entirely independent observations or not. Can the same person have multiple diagnosis codes? If so, then unless you have access to the individual-level data, statistical inferences combining over diagnosis codes are not possible without unrealistic independence assumptions.

If you have the individual-level data indicating whether for each person s/he does or does not have a diagnosis of each condition, then the CTABLES (Custom Tables) procedure in SPSS allows handling of such data under what it calls multiple response data, which would allow a chi-square test of association between group and diagnoses, as well as comparisons among groups for each diagnosis, with methods that take the dependence of the diagnoses into account. Data would be set up with a case for each person and a variable for each diagnosis, typically with a 1 if that diagnosis applies, and a 0 otherwise. The menus under Analyze>Tables will let you set up multiple response sets for this type of data, and perform the analyses.

Another option is a more complicated modeling approach using generalized estimating equations (GEE) or generalized linear mixed models (GLMM), both of which would require setting the data up with one case per subject per diagnosis, with a subject identifier variable, a diagnosis variable, and a binary variable indicating whether or not the subject has that diagnosis.

David Nichols
  • 840
  • 2
  • 9