What summary statistics to use with categorical or qualitative variables?

Question

Just to clarify, when I mean summary statistics, I refer to the Mean, Median Quartile ranges, Variance, Standard Deviation.

When summarising a univariate which is categorical or qualitative, considering both Nominal and Ordinal cases, does it make sense to find its mean, median, quartile ranges, variance, and standard deviation?

If so is it different than if you were summarising a continuous variable, and how?

I barely see any difference between categorical and qualitative variable, except one of terminology. Anyway, that would be very difficult to compute anything like mean or SD on a nominal variable (e.g., hair color). Maybe you are thinking of categorical variables with ordered levels? — chl, Jul 23 '12 at 07:57
Nope, if the categorical data has an order or ranked levels they are said to be Ordinal according to this website: [http://www.stats.gla.ac.uk/steps/glossary/presenting_data.html#orddat], and it says "You can count and order, but not measure, ordinal data" — chutsu, Jul 23 '12 at 09:02

score 8 · Answer 1 · answered Jul 23 '12 at 11:14

In general, the answer is no. However, one could argue that you can take the median of ordinal data, but you will, of course, have a category as the median, not a number. The median divides the data equally: Half above, half below. Ordinal data depends only on order.

Further, in some cases, the ordinality can be made into rough interval level data. This is true when the ordinal data are grouped (e.g. questions about income are often asked this way). In this case, you can find a precise median, and you may be able to approximate the other values, especially if the lower and upper bounds are specified: You can assume some distribution (e.g. uniform) within each category. Another case of ordinal data that can be made interval is when the levels are given numeric equivalents. For example: Never (0%), sometimes (10-30%), about half the time (50%) and so on.

To (once again) quote David Cox:

There are no routine statistical questions, only questionable statistical routines

You provide good related information but I think in response to chl question, the OP made it clear that he is talking about categorical data that is not ordinal. So your response is really not an answwer but I am not one who would give a downvote. But I do think you should change it to a comment. — Michael R. Chernick, Jul 23 '12 at 11:34
No, I won't downvote the answer as I do think it has added some value to my limited understanding. I should have made it clear in my description that I am considering both Ordinal and Nominal Summary statistics, so the fault is mine. — chutsu, Jul 23 '12 at 12:41

score 5 · Answer 2 · answered Sep 10 '15 at 22:00

As has been mentioned, means, SDs and hinge points are not meaningful for categorical data. Hinge points (e.g., median and quartiles) may be meaningful for ordinal data. Your title also asks what summary statistics should be used to describe categorical data. It is standard to characterize categorical data by counts and percentages. (You may also want to include a 95% confidence interval around the percentages.) For example, if your data were:

"Hispanic"         "Hispanic"        "White"             "White"            
"White"            "White"           "African American"  "Hispanic"        
"White"            "White"           "White"             "other" 
"White"            "White"           "White"             "African American"
"Asian"

You could summarize them like so:

White             10 (59%)
African American   2 (12%)
Hispanic           3 (18%)
Asian              1 ( 6%)
other              1 ( 6%)

Michael R. Chernick · Answer 3 · 2012-07-25T15:42:31.660

3

If you have nominal variables there is no ordering or distance function. So how could you define any of the summary statistics that you mention? I don't think you can. Quartiles and range at least require ordering and means and variance require numerical data. I think bar graphs and pie chart are typical examples of the proper ways to summarize qualitative variables that are not ordinal.

edited Jul 25 '12 at 15:42

answered Jul 23 '12 at 11:09

Michael R. Chernick

39,640
28
74
143

2

I agree about charts, but not about your recommendation of pie charts. Cleveland dot plots are much superior. I have a presentation on this coming up at NESUG: [Graphics for univariate data: Pie is delicious but not nutritious](http://www.statisticalanalysisconsulting.com/graphics-for-univariate-data-pie-is-delicious-but-not-nutritious/) – Peter Flom Jul 23 '12 at 11:18
3

@PeterFlom My point was not to list all the possiblr graphical procedures for summarizing qualitative data. I really want to emphasize that it is really proportion that can be compared and the way the proportions are distributed across the categories. For visually recognizing differences in proportions I think bar charts are easier to visualize than pie charts but they are just two popular ways to summarize categorical data. I don't want to say they are the best as I am not familiar with all the available methods. – Michael R. Chernick Jul 23 '12 at 11:25
7

They are certainly popular! But I think it's part of our responsibility, as experts in the field, to make pie charts *less* popular. – Peter Flom Jul 23 '12 at 11:31
@PeterFlom. I don't know. What is your big objection to pie charts? They never did anything bad to me. – Michael R. Chernick Jul 23 '12 at 20:21
2

Pie charts distort the data; with large numbers of categories they are unreadable. People are bad at perceiving angles. See books by William S. Cleveland (he also has a web site). – Peter Flom Jul 23 '12 at 22:03
@Peter Flom I am familiar with some of Bill Cleveland's work and books but not with all this negativity about pie charts. Can you give me a good argument against them? We all know that there are many presentations of bad graphics and there is an art to presenting good graphics. Tufte has been excellent at showing the bad and demonstrating ways to improve it. Huff's classic "How to Lie with Statistics" shows the many unethical ways subtle or not so subtle manipulation cn change the story that a graph conveys. But do we stop showing linear trends because we canlie by doing scla manipulations? – Michael R. Chernick Jul 24 '12 at 10:03
3

Cleveland showed, first, that people are worse at perceiving angular measurement than linear distance. Second, that changing the colors in a pie chart changed people's perceptions of the size of the slices. Third, that rotating the pie chart changed people's perceptions of the size of the slices. Fourth that people had trouble ordering the slices from largest to smallest unless they were very different sized. Cleveland dot plots avoid all these. – Peter Flom Jul 24 '12 at 10:21
1

I love Tukey's stem-and-leaf diagrams. They convey more information than histograms. But i am not going to say eliminate histograms. You argue that people display poor pie charts. Then I think the answer is to teach the proper way to do do pie charts. Maybe sometimes very small categories should be aggregated. Other times there may be good reasons to show the very small categories (not to compare the very thin slices but rather to see how many small ones there are and how mcuh smaller they are compared to the big categories). – Michael R. Chernick Jul 24 '12 at 10:23
1

When you have 4 equal categories it may be a little easier for some to see a pie chart with 4 right angles or 45 degree angles when there are 8 categories that are equal. Bar charts may not convey this as easily. So while I agree that some people are bad a perceiving angles (other than 180 degree angles, 90 degree angles or 45 angles) and some people make bad charts, I do not see that as justification to ban pie charts. – Michael R. Chernick Jul 24 '12 at 10:28
3

We have had some discussions around pie charts on this thread: [Problems with pie charts](http://stats.stackexchange.com/q/8974/930). – chl Jul 24 '12 at 12:12
2

Well @MichaelChernick you are certainly free to not see it that way, but the world's experts in graphical perception disagree with you. – Peter Flom Jul 24 '12 at 18:36
Gee Peter instead of just standing behind Bill Cleveland why don't you just give me a cogent argument for abolishing them? I can appreciate the argument about difficulty perceiving angle differences and about people's tendency to make poor ones but where is your counter to my argument. Do Cleveland, Tufte and other go so far as to outright state that pie chart should never be used? That seems to me would be more controversial than what I am saying. I wonder what John Tukey would have said about abandoning pie charts just because they can be misused. Maybe all of statistics should be band. – Michael R. Chernick Jul 24 '12 at 18:51
Ther probably isn't a single technique or graphical/exploratory method that hasn't been misused by someone! – Michael R. Chernick Jul 24 '12 at 18:52
6

@Michael "A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them ... pie charts should never be used."--Tufte. "Data that can be shown by pie charts always can be shown by a dot chart. ... in the 1920's a battle raged on the pages of *JASA* about the relative merits of pie charts and divided bar charts ... both camps lose because other graphs perform far better than either divided bar charts or pie charts."--Cleveland. As you know, Cleveland is not prescriptive: this is as strong as he gets about anything. – whuber Jul 24 '12 at 19:43
6

BTW, @Michael, I do agree with you and the arguments you are making in this thread (which I find convincing and well presented), but as a moderator I have to convey strong objections voiced by community members concerning the "tone of voice" you are adopting. Please follow the site's etiquette: stick to the subject and don't attack others. Don't even write stuff that might sound like an attack, even in jest. Of course the same admonition extends to everybody. – whuber Jul 24 '12 at 19:46
@whuber. I am sorry. I do not mean to be disrespectful of Peter Flom. I know him from both here and the consulting statisticians site on the ASA website. I respect his knowledge of statistics and I wasn't arguing about the content of his comments. But I did challenge his overwhelmingly strong conclusions and his evasiveness with respect to my arguments. That gets frustrating and I wanted to hear if he had a valid argument or could concede that even if pie chart can give distortion too often that is not enough justification to say they should be banned. – Michael R. Chernick Jul 24 '12 at 20:52
1

I really do think that if you are going to use the logic that misuse of a method means it should never be used when speaking of pie charts then it is not a big stretch to apply that to all of statistics. Of course I know he wouldn't except that. – Michael R. Chernick Jul 24 '12 at 20:54
@Michael, I agree with you - I can't personally imagine a situation where I'd want to use a pie chart but I've yet to see a rational argument for saying that no one should ever use one. But I think, for future discussions of this type, it may suffice to get your opinion on the record and let the community decide their's - that's something whuber and chl have also encouraged me to do, as I also have an argumentative streak in me, which I'm sure you know ;) Congrats on the 10k rep, btw. I hope you'll have a look at 'tools' - you can now see deleted posts and lots of cool lists. I enjoy them. – Macro Jul 24 '12 at 22:04
1

Thanks so much @Macro. Huber is a saint. i can try to be that way but it is hard. – Michael R. Chernick Jul 24 '12 at 22:32

score 2 · Answer 4 · answered Oct 09 '15 at 13:40

Mode still works! Is that not an important summary statistic? (What's the most common category?) I think the median suggestion has little to no value as a statistic, but the mode does.

Also count distinct would be valuable. (How many categories do you have?)

You might create ratios, like (most common category) / (least common category) or (#1 most common category) / (#2 most common category). Also (most common category) / (all other categories), like the 80/20 rule.

You can also assign numbers to your categories and go nuts with all the usual statistics. AA=1, Hisp=2, etc. Now you can compute mean, median, mode, SD, etc.

mapto · Answer 5 · 2018-08-04T11:10:53.527

I do appreciate the other answers, but it seems to me that some topological background would give a much-needed structure to the responses.

Definitions

Let's start with establishing the definitions of the domains:

categorical variable is one whose domain contains elements, but there's no known relationship between them (thus we have only categories). Examples, depend on the context, but I'd say in the general case, it is difficult to compare days of the week: is Monday before Sunday, if so, what about next Monday? Maybe an easier, but less used example are pieces of clothes: without providing some context that would make sense of an order, it is difficult to say whether trousers come before jumpers or vice versa.
ordinal variable is one that has a total order defined over the domain, i.e. for every two elements of the domain, we can tell that either they are identical, or one is bigger than the other. A Likert-scale is a good example of a definition of an ordinal variable. "somewhat agree" is definitely closer to "strongly agree" than "disagree".
interval variable is one, whose domain defines distances between elements (a metric), thus allowing us to define intervals.

Domain examples

As the most common set that we use, natural and real numbers have standard total order and metrics. This is why we need to be careful when we assign numbers to our categories. If we are not careful to disregard order and distance, we practically convert our categorical data in interval data. When one uses a machine learning algorithm without knowing how it works, one risks making such assumptions unwillingly, thus potentially invalidating one's own results. For example, most popular deep learning algorithms work with real numbers taking advantage of their interval and continuous properties. Another example, think of 5-point Likert scales, and how the analysis we apply on them assumes that the distance between strongly agree and agree is the same as disagree and neither agree nor disagree. Hard to make a case for such a relationship.

Another set that we often work with is strings. There are a number of string similarity metrics that come in handy when working with strings. However, these are not always useful. For example, for addresses, John Smith Street and John Smith Road are quite close in terms of string similarity, but obviously represent two different entities that could be miles apart.

Summary statistics

Ok, now let's see how some summary statistics fit in this. Since statistics works with numbers, its functions are well defined over intervals. But let's see examples on whether/how we could generalise them to categorical or ordinal data:

mode - both when working with categorical and ordinal data, we can tell which element is most frequently used. So we have this. Then we can also derive all the other measures that @Maddenker lists in their answer. @gung's confidence interval could also be useful.
median - as @peter-flom says, as long as you have an order, you can derive your median.
mean, but also standard deviation, percentiles, etc. - you get these only with interval data, due to the need for a distance metric.

Example of data contextuality

At the end, I want to stress again that the order and metrics you define on your data are very contextual. This should be obvious by now, but let me give you a last example: when working with geographical locations, we have lots of different way to approach them:

if we are interested in the distance between them, we can work with their geolocation, which basically gives us a two-dimensional numerical space, thus interval.
if we are interested in their part of relationship, we can define a total order (e.g. a street is part of a city, two cities are equal, a continent contains a country)
if we are interested in whether two strings represent the same address, we could work with some string distance that would tolerate spelling mistakes and swapping positions of words, but make sure to distinguish different terms and names. This is not an easy thing, but just to make the case.
There are plenty of other use cases, that all of us encounter daily, where none of this makes sense. In some of them there's nothing more to do than treat the addresses as just different categories, in others it comes down to very smart data modelling and preprocessing.

What summary statistics to use with categorical or qualitative variables?

5 Answers5

Definitions

Domain examples

Summary statistics

Example of data contextuality

Linked