Comparing categorized data

Question

I am quite rusty on my stats beyond standard deviation and linear regression, so I am not even sure about how to phrase this question.

I am looking at a long history of credit card qualification data, so I know how much money has been processed as rewards cards, AMEX, World cards, etc. I have this data for each month.

Now I need a way to test if one month of data is unusual, statistically speaking.

For example, in January, 21% of the volume was in Rewards, and in February, only 19% was in rewards, an so on through at least a year. I want to know if the volume in Rewards was unprecedented the next January.

Can anyone provide a name of a test I can look at to get an answer?

Are you just looking to see if some piece of data is a maximum? — Emily, Jul 15 '13 at 21:12
No. I have thought about this a little bit since asking, and have one method but I don't think it's going to get me anywhere useful. I found the average and standard deviation of each proportion. That is, I took the rewards volumes for each month as a percentage of the whole, and got a series of percentages: 21%, 19%, 21%, 15%, 17% ... and so on. I found an average of 19.92% with a standard deviation of 3.21%. With the new set of data I am trying to compare, I have 19.75%, which has a z-score of -0.0529. Summing the Z-scores might help me figure this out, assuming I can find a critical point — , Jul 15 '13 at 21:25
Excel and Python. I am not afraid of coding a solution, as long as I have the right test to code. — , Jul 15 '13 at 21:36

score 2 · Answer 1 · answered Aug 12 '14 at 04:58

Ok, some clarification first. It seems that you are only concerned about rewards being disproportionately high rather than the actual volume being high right?

This is exactly what McNemar's test (http://en.wikipedia.org/wiki/McNemar's_test) can tell you. All you would have to do is to enter your data into a contingency table, calculate the test statistic and find the p-value. It should be fairly straightforward once you go to that Wikipedia page.

If you want to ask that question for each type of a card separately, you are actually interested in volumes rather than fractions. Let's consider this problem for only one type of a card - they are all analogous.

If you have data from multiple years for that month it might not be unreasonable to assume that the monthly volumes are normally distributed (of course they are not because they don't take values less than 0, but it might still be a good approximation) with the same standard deviation and test if their means are the same. You could do that with a likelihood ratio test.

If these assumptions are not reasonable though given your data, you can use Kolomogorov Smirnov test. It is not as straightforward as the previous one but does not require any assumptions about the distribution.

Hopefully that helps! Good luck!:)

score 0 · Answer 2 · answered Jul 16 '13 at 09:10

You could use OLS to answer this. Suppose we have only 2 months $t=\{1,2\}$ and three kinds of cards $c=\{1,2,3\}$. In the following specification, you estimate the share of each kind of card for each month via interactions. Hence, $$Volume_{t,c}=\beta_0+\beta_1 Card2_{t,c}+\beta_2 Card3_{t,c}+\beta_4 Card1_{t,c}*Month2_{t,c}+\beta_5 Card2_{t,c}*Month2_{t,c}+\beta_6 Card3_{t,c}*Month2_{t,c}+e_{t,c}.$$ All variables are dummy variables. After estimating the coefficients, the volume for Card 1 in month 1 is just $\beta_0$, the volume for Card 2 in month 1 is $\beta_0+\beta_1$, the volume for Card 3 in month 2 is $\beta_0+\beta_2+\beta_6$ etc.

If you want to test whether there are differences between the two months, say whether the volume due to Card 2 differs, then you test if $$\beta_0+\beta1=\beta_0+\beta1+\beta_5\Leftrightarrow \beta_5=0.$$ This is just a simple t-test for the coefficient. (Testing whether overall volume differs between both months is also possible by testing a linear combination of coefficients, but for example some t-test on the mean of both months would be simpler and faster.) Similarly, you can test whether some coefficient (or a combination of coefficients) differ from some value, e.g., the average volume of that card.

Admittedly, this can become tedious if you have many months and cards. If you are only interested in 2 kinds of cards, you can subsume the rest in an "other" dummy and the number of coefficients to be estimated reduces greatly.

Finally, if you want the share of volume instead of volume, just divide the dependent variable by the overall volume that month ($\sum_c Volume_c$) and you get percentage points rather than volume.

I keep looking at this, and frankly I don't see it. (As I said, I'm really rusty.) If $B_0 = 52,000$, being the volume of card one in month one, then if the volume of card two in month one is 20,000 does that make $B_1 = 20,000$ or $B_1 = -32,000$? — Josh English, Jul 16 '13 at 18:13
$\beta_1=-32000$. All coefficients except for the intercept ($=\beta_0$) are relative to card 1, month 1. Hence, $\beta_1$ is the difference of card 2, month 1 to card 1, month 1. In this post there is some more elaborate explanation: http://stats.stackexchange.com/questions/60595/expected-value-from-a-regression-table/63173#63173 — Nameless, Jul 16 '13 at 18:49

Comparing categorized data

2 Answers2