QQ plot in Python

Question

I generated a qq plot using the following code. I know that qq plot is used to check whether the data is distributed normally or not. My question is what do the x and y axis labels indicate in qq plot and what is that r square value indicating??

  N = 1200
  p = 0.53
  q = 1000
  obs = np.random.binomial(N, p, size = q)/N

import scipy.stats as stats

z = (obs-np.mean(obs))/np.std(obs)

stats.probplot(z, dist="norm", plot=plt)
plt.title("Normal Q-Q plot")
plt.show()

enter image description here

I know that already there is a discussion about qq plot, but I couldnt understand the concept of despite of going through that discussion.

This is very close to being a duplicate of the linked thread - Python vs R is not an important distinction here - but the $R^2$ aspect is new. It might be a good idea for question and answers to focus a little more on that aspect to avoid duplication. (I wonder whether $R^2$ is prone to being misunderstood, since even for poor fit, the upwards slope that's inevitable in a QQ plot means we expect an $R^2$ somewhat larger than zero. So values that might be quite impressive in a regression analysis may not be quite so impressive here.) — Silverfish, Apr 10 '15 at 07:16
@Silverfish I would not find it helpful or worthwhile to focus on the $R^{2}$. Q-Q plots are typically *seen*, not just reported with a table of myriad $R^{2}$ values. As long as the visualization is there, why reduce it to a single number? If the Q-Q plot looks "bad", but the $R^{2}$ somehow looks "good", would you still claim it's normal? Most good packages do not even provide the $R^{2}$ for precisely this reason. This viz-versus-moment argument even has a cute name: [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet). — Mike Williamson, Jul 29 '17 at 17:35
@MikeWilliamson I agree that the $R^2$ is not likely to prove especially useful (this was part of my point, though I may have expressed it poorly). My main point was that "how to interpret a QQ-plot" has been discussed at length [here](https://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot), as the question already notes - the only reason this thread should not be closed as a duplicate is the query about $R^2$, so that really ought to be discussed in the answers here (even if it is to say that it is not useful!) — Silverfish, Jul 30 '17 at 15:12
Are you sure that you plot a Q-Q plot? `help(probplot)` states: _`probplot` generates a probability plot, which should not be confused with a Q-Q or a P-P plot._ — abukaj, Jun 05 '18 at 14:18

Mike Williamson · Answer 1 · 2015-11-24T18:54:33.797

Macond's answer is accurate, however from the original post, I thought it might be helpful to simplify the verbiage a bit.

A Q-Q plot stands for a "quantile-quantile plot".

It is a plot where the axes are purposely transformed in order to make a normal (or Gaussian) distribution appear in a straight line. In other words, a perfectly normal distribution would exactly follow a line with slope = 1 and intercept = 0.

Therefore, if the plot does not appear to be - roughly - a straight line, then the underlying distribution is not normal. If it bends up, then there are more "high flyer" values than expected, for instance. (The link provides more examples.)

What do the x & y labels represent?

The theoretical quantiles are placed along the x-axis. That is, the x-axis is not your data, it is simply an expectation of where your data should have been, if it were normal.

The actual data is plotted along the y-axis.

The values are the standard deviations from the mean. So, 0 is the mean of the data, 1 is 1 standard deviation above, etc. This means, for instance, that 68.27% of all your data should be between -1 & 1, if you have a normal distribution.

What does the $R^2$ value mean?

The $R^2$ value is not particularly useful for this sort of plot. $R^2$ is typically used to determine whether one variable is dependent upon another. Well, you are comparing a theoretical value to an actual value. So there will necessarily be some sort of $R^2$. (E.g., even a random uniform distribution will have a moderately decent $R^2$.)

Lastly, there is a similar plot that is rarely used called the p-p plot. This plot is more useful if you are interested in focusing upon where the bulk of the data lies, instead of the extremes.

The word _skewed_ is not the best choice here: I'd say _transformed_. — Nick Cox, Nov 24 '15 at 08:41
Great explanation. Can you please explain how the x-axis (expected values) are generated ? — Wickkiey, Apr 06 '20 at 06:17
@VivekAnanthan The x-values are generated just as most other plots: they are the independent variables and you just choose them. For instance, if you have a normal distribution, then x=0 is the mean, -1 is 1 std dev below the mean, etc. Since these distributions are defined, you can calculate it. Let me pretend, but let's say that with 10 values, the extreme values *should* be at +/- 1.2 std dev, then the leftmost and rightmost points will have x=-1.2 and x=+1.2 and will align with the min and max. Make sense? — Mike Williamson, May 05 '20 at 10:16

score 2 · Answer 2 · answered Mar 09 '15 at 15:28

Y axis shows values of observed distribution and X axis, values of theoretical distribution.

Each point is a quantile. Let's say, if there were 100 points on the plot, the first point (the one on lower-left side) indicates an upper bound for an interval, and when ordered from smallest to largest, the smallest 1 percent of the data points of the corresponding distribution stays in this interval. Similarly, 2nd point is upper bound of an interval, where smallest 2 percent of data points from the distribution is located. This is the concept of quantile. But it is not limited to a case with 100 intervals, it is a general concept and you can have as many intervals as possible, then you will have that many quantiles describing boundaries of the intervals.

What is special about this plot is, each point's position determines actual value of the given quantile in both distributions, as corresponding value on the axis. Let's think as if there are 100 such points (quantiles) again, this plot tell that smallest 1 percent of data points from observed distribution is between ($ -\infty $, -3.5] and also smallest 1 percent of data points from theoretical distribution is between ($ -\infty $, -3.2]. This way you can see locations of each interval boundary's position in both distributions.

I used data points throughout my answer, like ordered data points etc. This refers to discrete distributions, but the concept can be generalized for continous distributions.

$R^2$ is a measure of how good the points fit to the red line. If both axes had the same distribution, all of the points would be exactly on the line and $R^2$ would equal 1. You can learn more about it in any text explaining linear regression.

The texts on linear regression will not explain, though, how to interpret $R^2$ when the points are as severely constrained as the ones on a QQ plot are! In particular, the points on a QQ plot must be monotonically non-decreasing. This forces $R^2$ to be extraordinarily high no matter what. — whuber, Nov 24 '15 at 14:51

QQ plot in Python

2 Answers2

Linked