Should I compare the median or mean for my data?

Question

I have a data frame in R that contains the information about 252 freshman university students.

head(freshmangrad, 10)
   graduation_college gradlength
1                  BI       1247
2                  AE       1247
3                  AE       1247
4                  EN       1247
5                  AE       1247
6                  EN       1735
7                  AE       1247
8                  LS       1004
9                  AE       1247
10                 EN       1247

graduation_college is the college a student graduated from (College of Letters and Science, College of Engineering, etc.) and gradlength is how long it took in days for each student to graduate. Each row is a unique student.

I want to know if it takes students from a particular college longer to graduate than students in other colleges. My first instinct was to create a box and whisker plot using ggplot2: ggplot(freshmangrad, aes(x = graduation_college, y = gradlength)) + geom_boxplot()

Then I wondered what a plot of the means would look like for my data using the package gplots: plotmeans(freshmangrad$gradlength~freshmangrad$graduation_college, connect = FALSE)

These two plots look very different. In the first plot, it doesn't look like it takes students from one particular college longer to graduate than students from another college by looking at the median. In the second plot, you can clearly tell students from LS (College of Letters and Science) graduate faster than students from the other colleges.

Which plot should I be using? Why does one plot show there's a relationship between the graduation college and the time it takes to graduate, while the other plot other does not? Is there an error in my code perhaps?

BruceET · Answer 1 · 2019-07-04T05:48:42.120

Within each college, many students took about 1250 days to graduate--enough to make that the approximate median for each college. Differences are that LS seems to have a roughly symmetrical distribution around 1250 with some finishing faster and some slower.

The sample median of $(2,2,2,2,2,2,3,3,3,3)$ is 2; as is the median of $(1,1,1,2,2,2,3,10,20,50).$ But the sample means are very different.

a = c(2,2,2,2,2,2,3,3,3,3)
median(a); mean(a)
[1] 2
[1] 2.4

b = c(1,1,1,2,2,2,3,10,20,50)
median(b); mean(b)
[1] 2
[1] 9.2

By contrast, the other 3 colleges have right-skewed distributions with enough graduates taking longer to draw the mean up higher than the median. Clearly, the mean shows differences among colleges that the median does not.

However, the crux of the matter might lie in looking at those who took substantially longer than 1250 days to graduate. If having students graduate 'on schedule' is considered an administrative or academic success at this University, it might be worthwhile trying to discover the personal and bureaucratic reasons for delayed graduations. Statistically, it might be interesting to make comparisons--among colleges--of the quarter of the students in each college who take the longest to graduate.

If I had to choose between the two graphical displays, I would pick the boxplots, specifically because they emphasize differences in numbers of delayed graduations. The second plot, of means and standard deviations, hides interesting differences among the colleges, in particular the skewness of three of the distributions. They make it look like everything is about the same, except that LS students graduate in fewer days.

Note: My guess is that a one-way ANOVA may show significant differences among the colleges that are fairly convincing (unless someone notes non-normal residuals). Also, that a Kruskal-Wallis test might not show significant differences, because some of the essence of the differences gets lost in reducing that data to ranks. (Of course, I can't be sure of this because I don't have access to the data and can't run the tests to see what actually happens.)

(+1) Better yet than the bare box plots is any kind of display that shows all the data too. Examples at https://stats.stackexchange.com/questions/205629/histogram-or-box-plot-to-compare-two-distributions-of-means https://stats.stackexchange.com/questions/114744/how-to-present-box-plot-with-an-extreme-outlie — Nick Cox, Jul 04 '19 at 05:59
What @NickCox says. With the large concentration at 1250 dominating the discussion, it would be worthwhile to invest some time in finding a good visualization that shows this aspect. Maybe four histograms (with axes all to the same scale, so they can be compared). — Stephan Kolassa, Jul 04 '19 at 06:28

Should I compare the median or mean for my data?

1 Answers1