0

I have two data sets to compare. Each is a list of billed amounts by diagnosis codes. The data differs in that the diagnosis codes may be different for some of the billed amounts. There are approximately 32 different diagnosis codes that were used. What type of statistical analysis is most appropriate in comparing these two datasets and why?

I was told to complete a two way anova. Is that correct?

  • I don't understand the statement, "the diagnosis codes may be different for some of the billed amounts". Can you clarify what that means? – gung - Reinstate Monica Sep 02 '14 at 18:48
  • One dataset was generated by summing the billed amounts for each diagnosis. Then the logic for assigning a diagnosis was changed and the next dataset was generated by summing the billed amounts for each diagnosis (using the new logic). So some of the money may have shifted buckets (changed from one diagnosis to another) but the total amount billed over all diagnoses did not change. – Jennifer Sep 02 '14 at 18:55
  • What conclusion do you want to draw based on your statistical analysis (or what type of conclusion). Are you interested in the relative frequency of the diagnostic codes in the two data sets, in the mean amount billed by diagnostic code or overall, or something else? – Joel W. Sep 04 '14 at 14:44

1 Answers1

3

When modeling financial data, one should often use the gamma distribution as the error distribution in a generalized linear model instead of the normal distribution. You could also try a log transformation to normalize the billed amounts and apply ANOVA as usual thereafter. See also "In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?"

If your two datasets use the same diagnosis codes and differ in some other way, you could control the effect of diagnosis code when estimating that difference in datasets using a two-way ANOVA or generalized linear model. You could also test whether the effect of diagnosis code differs in your two datasets by including an interaction term. Be cautious in interpreting your results if your datasets are unbalanced or violate other assumptions of the analysis.

Nick Stauner
  • 11,558
  • 5
  • 47
  • 105
  • The point of the exercise is to compare the billed amounts per diagnosis of each of the datasets, to determine how large the differences between the billed amounts are for each diagnosis code – Jennifer Sep 02 '14 at 19:56
  • Okay...that's not how I understood your question. Now it sounds like you want to compare your two datasets 32 times (once per code). There are many ways to do that, and the choice will depend on whether you're trying to infer anything about broader populations to which your datasets belong, and whether you want to test hypotheses. – Nick Stauner Sep 02 '14 at 20:42
  • The idea is to see how different the distributions of the dollar amounts per diagnosis codes are under the first diagnosis logic vs the second diagnosis logic. There is no inference for the broader population. – Jennifer Sep 02 '14 at 21:01
  • That seems simple enough. You might try just superimposing color-coded histograms / kernel density plots separately for each diagnosis code. Maybe mark the means and quartiles on each distribution too. – Nick Stauner Sep 02 '14 at 21:39
  • There is no silver bullet of a statistical technique that will allow you to make sound comparisons between two things that, as you've demonstrated, are not comparable. – rolando2 Sep 03 '14 at 00:14