Why does the line of best fit start near zero given this data?

Question

I am using the Wage data set from here. I created a qq plot in R like so: ggqqplot(Wage$age, ylab = "Wage").

The best-fit line starts below zero, which for me is strange. I would expect it to start with the lowest age that exists in the data set (in this case it is 18).

If I instead generate some random numbers that follow the normal distribution: numbers <- rtnorm(3000, 50, 3, 10, 100) the lowest value I have is then 39. If I plot this the best-fit line starts at 39, which I would expect.

I feel like I am missing something obvious here, but can't really understand what.

score 1 · Accepted Answer · answered Jan 30 '20 at 16:43

There's nothing "special" about the lowest point in the range. In the second plot, it just so happens that it's possible to draw a good best-fit line that start very close to the bottom-most point, and still lies close to the rest of the points. If you started the best fit line for the first graph at the bottom-most point, any straight line would miss all the points in the middle by a wide margin. The best fit line minimizes the overall error over all the points simultaneously - there's no reason why you'd want to specifically minimize that error for the bottom-most point, at the cost of increased error for many other points. In the first graph, a line that goes throught the bottom most point simply cannot fit the dataset as a whole as well as one that does not go through that point (which is the line shown).

The strong linear behavior of the points in the middle of the first graph acts as an "anchor" that yields very low error when drawing a line through those points. Most of the data is summarized well by that line, so it makes sense that the best fit line is close to that. The other perhaps 10% of the data at the left of the graph doesn't fit very well with this linear behavior, but it's better than the alternative of fitting that 10% and then missing the other 90% by a wide margin.

score 1 · Answer 2 · answered Jan 30 '20 at 17:18

I suspect that the lines drawn on these qq plots are not the "best fit" lines in the sense of a linear regression. They more likely represent how the data values would have been distributed if they had been sampled from a normal distribution with the mean and variance indicated by the data. That's quite clear from the second plot; if you start with normally distributed data, the data will fall closely along that line.

The reason that the line does not fit the data in the first plot is that the data were not drawn from an underlying normal distribution. Had they been drawn from a normal distribution, the line indicates that wage values in the bottom 0.5% or so of the data (around 3 standard deviations below the mean for a normal distribution, given that the x-axis seems to have units of standard deviations from the mean) would have wage values close to or below 0. The points show that this low-end tail of the actual data values never get that low, arguing against a normal distribution for them. That makes sense with data that are fundamentally non-negative like wage data: unlike wages (at least as typically expressed) a normal distribution in principle can have values below zero.

See this page on qq plots for how to interpret such plots in terms of different types of deviations from normality.

Why does the line of best fit start near zero given this data?

2 Answers2