5

There's a Forbes article about student-college match quality that contains an interesting graph based on a working paper by Eleanor Dillon and Jeff Smith.

The description reads:

The chart below, which represents individuals who attended college in the early 2000s, details how well students in various ability quartiles (measured by a broad-ranging aptitude test) are matched to college quality quartiles. Perfect matching would place 25% of the student population in each circle along the diagonal, with no students in the other circles.

About 36 percent of students are appropriately "matched" to colleges based on ability. About 36 percent of students are appropriately “matched” to colleges based on ability. Students in both the top ability quartile and the top college quality quartile represent 11% of the overall student population, or 44% of all students in the top ability quartile. Overall, 36% of students attend a college in their corresponding quality quartile, and 77% attend a school within one quartile of their ability group.

Here's the Forbes graph: match quality

I tried improving on this chart since I find circular areas hard to compare and I wanted to see not just the absolute percentage, but the marginals as well. Here's my attempt:

enter image description here

I did not bother to do the match shading (I am not sure there's a good metric), but I still find my graph unsatisfactory. It requires a lot of arithmetic to get insights out.

How would you display this data?

    cq   sa    pct  
     1    1   10.6  
     1    2    7.1  
     1    3    5.2  
     1    4      3  
     2    1    6.8  
     2    2    6.5  
     2    3    6.4  
     2    4    4.3  
     3    1      4  
     3    2    6.9  
     3    3    7.4  
     3    4    7.4  
     4    1    2.1  
     4    2    4.6  
     4    3    6.5  
     4    4   11.4  
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
dimitriy
  • 31,081
  • 5
  • 63
  • 138

3 Answers3

4

It seems to me it is worth noting that these are, in essence, agreement data. We should use a plot designed for displaying and assessing such data. The plot I'm most familiar with for this purpose is Bangdawala's agreement chart. You can find it discussed here:

In R, you can create one with ?agreementplot in the vcd package. (I know it can be done in SAS using the AGREE option in PROC FREQ, and I'm sure there are Stata macros for it as well.)

library(vcd)
d = read.table(text="cq   sa    pct  
...  
4    4   11.4", header=T)
tab = xtabs(pct~cq+sa, d)

windows()
  agreementplot(tab)

## you can also get the Bangdiwala B agreement statistics: 
print(agreementplot(tab))
# $Bangdiwala
#           [,1]
# [1,] 0.1352742
# 
# $Bangdiwala_Weighted
#           [,1]
# [1,] 0.5426176
# 
# $weights
# [1] 1.0000000 0.8888889

enter image description here

Some things to note from this plot are:

  1. The rectangles lie along the red diagonal. This means that neither measure is systematically higher or lower than the other. (That is, neither is a biased measure of the other.)
  2. The heavy black rectangles are a fairly small proportion of the area of the outer rectangles, indicating that the matching of students to schools is far from perfect.
  3. (The gray rectangles represent partial—'off by 1'—agreement.)
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 2
    Glad to learn about this, but I'm suspicious that the data was engineered to be on the diagonal by computing the quartiles on the available data (not within a larger population), with deviations from perfect 25%s due to ties. – xan Jun 16 '16 at 19:55
  • 1
    @xan, I'm sure you're right. The variables were both cut into quartiles & so really cannot be biased relative to each other. I still mentioned that just for completeness. I nonetheless see these as essentially agreement data & think this is the plot to use (for yourself--I wouldn't put this in the newspaper for casual readers). – gung - Reinstate Monica Jun 16 '16 at 20:56
  • @gung Why is there no off-by-3 rectangle? – dimitriy Jun 16 '16 at 23:02
  • @DimitriyV.Masterov, the outer white rectangles combine the off-by-2s & the off-by-3s. It may be possible to get the function to plot the off-by-2s as well (the off-by-3s are automatically the white border), but I didn't look very far into it. If the function can't do it, it would be possible to code it up yourself. – gung - Reinstate Monica Jun 17 '16 at 02:00
2

I think the biggest weakness of the original is that the color intensity dominates our perception even though it is practically meaningless in that it duplicates information already represented by the positions. I imagine that leading to your dissatisfaction and search for alternatives.

Here is a version using color intensity for counts instead of using size for counts.

categorical heat map

It does a decent job of showing the counts falling off from the diagonal and that quartiles 2 and 3 are not that different. Neither color not area is very easy to perceive accurately, but I switched from area to color for percents because it's more "glance-able" for pattern recognition. I used discrete colors instead of continuous colors to mask what I judged to be meaningless variations.

Looking at marginals, I'm finding it easier to see patterns in separate bar charts than in overlaid lines -- not sure why.

enter image description here

enter image description here

With some effort, it may work to append the bars charts to two edges of the heat map for a true "marginal" effect.

Lines do make it easier to think in terms of what happens when the X variable changes from over value to the next.

enter image description here

The data seems too coarse to go very far with a visualization.

xan
  • 8,708
  • 26
  • 39
  • Experiments have demonstrated that color intensities are even harder to compare than areas. – whuber Jun 16 '16 at 17:38
  • 1
    Barely though, right? My main point was that with color and area both present, the area is less prominent, but I muddled the point by switching percent from area to color instead of just removing color altogether. I also discretized the colors to emphasize the similarity of the middle values, and I need to point that out in the answer. I do think that color intensity is more glance-able. – xan Jun 16 '16 at 17:48
1

I think the current charts show the data pretty well. The stacked bar chart has such nice progressions it is easier to follow along than most stacked bar charts. The original bubble chart shows that there is a reasonable correlation between the two (I calculated it at 0.36).

One alternative is a dot plot/line chart.

enter image description here

One thing I like about this is the ability to de-trend, and then plot the same lines. (So you can see deviations from expected, as oppossed to simply the bivariate percentages.) I'm not sure what a reasonable model is though. A default model are the residuals from the cross-tab table, in this case it just replicates the original chart though.

enter image description here

It strikes me (both from the original bubble plot and this dot chart) that there is more binning at the extremes, but I'm not sure of a way off-hand to quantify that.

There are always more fancy things you could do (like a network graph that has the two quartile sets as nodes and shows weighted lines). But I think these examples are basically all you need.

Andy W
  • 15,245
  • 8
  • 69
  • 191
  • These solutions appear to miss the point of the original scatterplot, which is intended to display *positive correlation*. – whuber Jun 16 '16 at 17:38
  • Well sure, this does not do the job of showing correlation as well as a scatterplot. There are other comparisons that might be of interest though (the OP was not real specific). For example for student quartile 2, the marginal proportion in the 4th, 3rd, and 2nd school quartiles are are basically equivalent. – Andy W Jun 16 '16 at 17:52
  • I think the question of `Prob(School Quartile|Student Ability Quartile)` is a bit more of an interesting question (given the data at hand scatterplots of the original variables would be most preferable). (Which suggests I might change the percents to sum to 100 within each student quartile.) Also the original OP's graphs have the X and Y axis reversed if that is the question you are interested in. – Andy W Jun 16 '16 at 17:58
  • The main point I cared to make was really that if you had an expected value for the probability, you could plot the deviation from that value instead of the proportion itself. (I stated in my first sentence I think the original graphs do an alright job of showing most of the (banal) details of the data.) I'm not sure a reasonable expectation though given the description -- all seems as we might expect it to be. – Andy W Jun 16 '16 at 17:59