Comparing the age distribution of two groups with different sample sizes

Question

I have a dataset regarding a hotel. I am trying to compare the age distribution of two groups: Leisure and business. The sample sizes of these groups are not equal. The business group has 12.4k entries, and the leisure group has 19.6k entries. In reality, the business group is a lot bigger.

For 18 year olds, the leisure has an absolute value of 66, and the business has a value of 8.

If I want to compare these two, would it be statistically correct to calculate them to percentages (of their own respective groups, i.e. 8/12400 and 66/19600) and compare them then?

What do you mean by "compare"? Descriptively? Using statistical hypothesis tests etc? — Michael M, Apr 13 '16 at 12:13
I would like to research whether there are any big differences in the groups. For example: Do people in the group aged 18 - 25 usually book leisure or business? — JackRumble, Apr 13 '16 at 12:32
With your second paragraph I find myself at al loss to understand what it means. When you say "leisure has an absolute value of 66, and the business has a value of 8"... what does this *absolute value of 66* refer to? How many age groups are there? — Glen_b, Sep 23 '16 at 08:01
If you want to compare the proportions in leisure and business for each age group it sounds like a test of homogeneity (/independence) for a contingency table, for which a chi-squared test may make sense -- but what worries me is several potential sources of dependence. — Glen_b, Sep 23 '16 at 08:03

score 3 · Answer 1 · answered Feb 11 '17 at 18:52

There are a few reasonable ways to approach a problem like this. The best way for you will depend on what you want the focus of your analysis to be.

Focusing on understanding age differences by group

To test whether there are age differences between the two groups (e.g. Are people who book leisure travel younger than those who book business travel?), use a t-test. Because of the uneven sample sizes, you may want to opt for the Welch approximation t-test, which does not make the assumption of homogeneity of variances (and which may be the safer choice in general). For example:

> # set random seed to match these results exactly, if desired
> set.seed(24601)
> # generate some toy data
> leisure <- data.frame(type = "leisure", age = rnorm(n=19600, mean = 40, sd = 10))
> business <- data.frame(type = "business", age = rnorm(n=12400, mean = 50, sd = 8))
> df <- rbind(leisure, business)
> head(df)
     type      age
1 leisure 37.43849
2 leisure 45.83482
3 leisure 44.73519
4 leisure 21.21979
5 leisure 37.25679
6 leisure 33.63798

The Welch approximation is the default for t.test(), so no need to do anything special:

> t.test(age ~ type, data = df)

    Welch Two Sample t-test

data:  age by type
t = -99.61, df = 30121, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.28936  -9.89224
sample estimates:
 mean in group leisure mean in group business 
              39.97088               50.06168

This shows us that the leisure group are younger than the business group, by approximately 10 years (95%CI -10.29 to -9.89). Here's a quick visualization of the difference:

> library(ggplot2)
> ggplot(df, aes(x=age, fill = type)) + 
+   geom_histogram(alpha = .5, bins = 30, position = "identity") + 
+   theme_classic()

Focusing on predicting group by age

If you'd rather focus on type of travel as the outcome, you may want to frame your analysis in terms of predicting the probability of one type of travel over the other using age as a predictor (e.g. What's the probability of 20-year-olds booking business rather than leisure travel?).

The simplest way to do this is to treat leisure and business as the only two possible (mutually exclusive) outcomes. In other words, you're only modeling a population of people who are definitely booking travel at it's for EITHER leisure or business, with no other possible options. That's a simplification of reality, of course, but it seems like it may be reasonable in this case. You can model this with a logistic regression model:

> # test for probability of booking type by age
> logit.model <- glm(type ~ age, data = df, family = "binomial")
> summary(logit.model)

Call:
glm(formula = type ~ age, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5880  -0.8589  -0.4639   0.9738   2.5438  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.909358   0.075339  -78.44   <2e-16 ***
age          0.120670   0.001609   75.02   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 42727  on 31999  degrees of freedom
Residual deviance: 34674  on 31998  degrees of freedom
AIC: 34678

Number of Fisher Scoring iterations: 4

Unsurprisingly, again, there is a significant effect (because age and travel type are not independent in the way I generated the data). With higher age, travel type is more likely to be business rather than leisure. You can visualize this by extracting and plotting the predicted probabilities of booking business vs. leisure travel at each age, and plotting it with the actual observations as a kind of rug plot for each group.

> df$probs <- predict(model, df, type = "response")
> df$type_num <- as.numeric(df$type) - 1 # make a numeric version for plotting
> 
> ggplot(df, aes(x = age)) + 
+   geom_point(aes(y=type_num, color = type), alpha = .2) +
+   geom_line(aes(y=probs)) + 
+   labs(y="Probability of Booking Business Travel") + 
+   theme_classic()

You can see from the plot here that the probability of a 20-year-old booking business rather than leisure travel is almost 0% (in my completely made-up data). By about 50 it's pretty much an even chance, and over about 60 it's much more likely that they'll book business rather than leisure travel.

TrynnaDoStat · Answer 2 · 2016-04-13T12:18:12.060

If you want to make no assumption about the distribution of age, consider the two-sample Kolmogorov–Smirnov test. The two-sample Kolmogorov–Smirnov compares the difference in empirical distribution functions (ECDF) of two samples (meaning it considers both location and shape of the the two samples). There are several packages in R for this, one of which is dgof.

If you can assume age is Normally distributed, you can just perform a two sample t-test. If it looks like your data is Normally distributed, you should run a two-sample t-test as this will have the most statistical power. If you want to check how Normal your age data are, you can use the Shapiro-Wilk test (or even just plot the density of age by using plot(density(age)) where age is the vector with the age data in R).

If you want to just visually compare the densities, let's say you data is in this format:

Age    Group
18     Leisure
27     Business
33     Business
21     Leisure

You can run,

library(ggplot2)
ggplot(data = dat) + geom_density(aes(x = Age, group = Group))

That's more or less what I did. I calculated the frequencies using `table()` and turned that back into a dataframe. Then I calculated the percentages from the frequencies and addes this to the dataframe, and added a new column with the type (either leisure or business). I merged the leisure and business dataframe using `rbind()` and plotted them as: `ggplot(df_business_leisure, aes(age, prct, color = Type)) + geom_line()` My doubts lie in the fact that the sample size is uneven and my internship supervisor told me I can't compare it like this because of that. — JackRumble, Apr 13 '16 at 12:48
The plot is just a visual to understand what's going on and/or supplement a formal test. The other two tests I mentioned are formal hypothesis tests that will take sample sizes into account. — TrynnaDoStat, Apr 13 '16 at 12:55

score 1 · Answer 3 · answered Jan 07 '17 at 16:18

I suspect that statistical tests won't be helpful. With such large sample sizes, of course the null hypothesis of exactly equal age distributions will be rejected.

My suggestion: Draw a graph (histogram) of age distributions in the two groups, and make your conclusions based on looking at the graph, without any statistical testing.

Comparing the age distribution of two groups with different sample sizes

3 Answers3

Focusing on understanding age differences by group

Focusing on predicting group by age