20

Can someone provide an intuition on why the higher moments of a probability distribution $p_X$, like the third and fourth moments, correspond to skewness and kurtosis respectively? Specifically, why does the deviation about the mean raised to the third or fourth power end up translating into a measure of skewness and kurtosis? Is there a way to relate this to the third or fourth derivatives of the function?

Consider this definition of skewness and kurtosis:

$$\begin{matrix} \text{Skewness}(X) = \mathbb{E}[(X - \mu_{X})^3] / \sigma^3, \\[6pt] \text{Kurtosis}(X) = \mathbb{E}[(X - \mu_{X})^4] / \sigma^4. \\[6pt] \end{matrix}$$

In these equations we raise the normalised value $(X-\mu)/\sigma$ to a power and take its expected value. It is not clear to me why raising the normalised random variable to the power of four gives "peakedness" or why raising the normalised random variable to the power of three should give "skewness". This seems magical and mysterious!

Ben
  • 91,027
  • 3
  • 150
  • 376
user248237
  • 569
  • 7
  • 14
  • 4
    My intuition on skew is to note that the third power preserves negatives. So if you have more large negative deviations from the mean than you do positive (put very simply), then you end up with a negative skewed distribution. My intuition for the kurtosis is that the fourth power amplifies large deviations from the mean much more than the second power. This is why we think of kurtosis as a measure of how fat the tails of a distribution. Note that very large possibilities of x from the mean mu are raised to the forth power, which makes them amplified but ignores sign. – wolfsatthedoor Nov 09 '14 at 02:09
  • 2
    See http://stats.stackexchange.com/questions/84158/how-is-the-kurtosis-of-a-distribution-related-to-the-geometry-of-the-density-fun/84213#84213 – whuber Nov 09 '14 at 02:12
  • 1
    Since 4th powers are much more affected by outliers than 1st powers, I expect you'll gain little from looking at the fourth moment about the median -- at least if robustness was the aim. – Glen_b Nov 09 '14 at 08:01
  • 1
    First, note that these higher moments are not necessarily good/reliable measures of asymmetry/peakedness. That said, I think beams give a good physical intuition for the first three moments, e.g. mean = [beam balance/scale](https://en.wikipedia.org/wiki/Triple_beam_balance), variance = [cantilever flexure](https://en.wikipedia.org/wiki/Cantilever), skewness = [seesaw](https://en.wikipedia.org/wiki/Seesaw). – GeoMatt22 Dec 10 '16 at 04:12
  • 2
    You are right, the interpretation of kurtosis as measuring "peakedness" *is* magical and mysterious. That's because it's not at all true. Kurtosis tells you absolutely nothing about the peak. It measures the tails (outliers) only. It is easy to prove mathematically that the observations near the peak contribute a miniscule amount to the kurtosis measure, regardless of whether the peak is flat, spiked, bimodal, sinusoidal, or bell-shaped. – BigBendRegion Nov 21 '17 at 01:22
  • The answer to the follow-up is that the 4th moment around the median (and you should normalize by something like IQR not $\sigma$) also measures tails, not peak, because the data near the peak contribute virtually nothing to the measure, while the tails contribute virtually everything. It's the same logic as to why the ordinary Pearson kurtosis measures tails, not peak. If you want to measure the peak, use information about the peak. For example, the second derivative of the density at its mode is a measure of the peak. Something like that would make sense. – BigBendRegion Nov 17 '18 at 18:11
  • Kurtosis does not [exactly](https://math.stackexchange.com/questions/2040087/can-kurtosis-measure-peakedness/3781761#378176) measure peakedness. Peakedness is not the only aspect of a distribution that can cause a high kurtosis. Kurtosis measures the tendency of values to be spread far out over a large distance relative to the distance of one standard deviation. The kurtosis is an average of this tendency and either due to many variables being a little far away but it can be also due to a few values being very far away (that latter does not relate to a sharp peak but still gives high kurtosis) – Sextus Empiricus Jan 30 '22 at 08:55
  • One can get some intuition from the idea of *graphical moments*, see [some illustrations here](https://stats.stackexchange.com/a/362745/11887) – kjetil b halvorsen Jan 30 '22 at 14:29

2 Answers2

12

There is a good reason for these definitions, which becomes clearer when you look at the general form for moments of standardised random variables. To answer this question, first consider the general form of the $k$th standardised central moment:$^\dagger$

$$\phi_k = \mathbb{E} \Bigg[ \Bigg( \frac{X - \mathbb{E}[X]}{\mathbb{S}[X]} \Bigg)^k \text{ } \Bigg].$$

The first two standardised central moments are the values $\phi_1=0$ and $\phi_2=1$, which hold for all distributions for which the above quantity is well-defined. Hence, we can consider the non-trivial standardised central moments that occur for values $k \geqslant 3$. To facilitate our analysis we define:

$$\begin{equation} \begin{aligned} \phi_k^+ &= \mathbb{E} \Bigg[ \Bigg| \frac{X - \mathbb{E}[X]}{\mathbb{S}[X]} \Bigg|^k \text{ } \Bigg| X > \mathbb{E}[X] \Bigg] \cdot \mathbb{P}(X > \mathbb{E}[X]), \\[8pt] \phi_k^- &= \mathbb{E} \Bigg[ \Bigg| \frac{X - \mathbb{E}[X]}{\mathbb{S}[X]} \Bigg|^k \text{ } \Bigg| X < \mathbb{E}[X] \Bigg] \cdot \mathbb{P}(X < \mathbb{E}[X]). \end{aligned} \end{equation}$$

These are non-negative quantities that give the $k$th absolute power of the standardised random variable conditional on it being above or below its expected value. We will now decompose the standardised central moment into these parts.


Odd values of $k$ measure the skew in the tails: For any odd value of $k \geqslant 3$ we have an odd power in the moment equation and so we can write the standardised central moment as $\phi_k = \phi_k^+ - \phi_k^-$. From this form we see that the standardised central moment gives us the difference between the $k$th absolute power of the standardised random variable, conditional on it being above or below its mean respectively.

Thus, for any odd power $k \geqslant 3$ we will get a measure that gives positive values if the expected absolute power of the standardised random variable is higher for values above the mean than for values below the mean, and gives negative values if the expected absolute power is lower for values above the mean than for values below the mean. Any of these quantities could reasonably be regarded as a measure of a type of "skewness", with higher powers giving greater relative weight to values that are far from the mean.

Since this phenomenon occurs for every odd power $k \geqslant 3$, the natural choice for an archetypal measure of "skewness" is to define $\phi_3$ as the skewness. (The higher-order odd moments $k=5,7,9,...$ are sometimes called measures of "hyperskewness".)This is a lower standardised central moment than the higher odd powers, and it is natural to explore lower-order moments before consideration of higher-order moments. In statistics we have adopted the convention of referring to this standardised central moment as the skewness, since it is the lowest standardised central moment that measures this aspect of the distribution. (The higher odd powers also measure types of skewness, but with greater and greater emphasis on values far from the mean; these are sometimes called measures of "hyperskewness".)


Even values of $k$ measure fatness of tails: For any even value of $k \geqslant 3$ we have an even power in the moment equation and so we can write the standardised central moment as $\phi_k = \phi_k^+ + \phi_k^-$. From this form we see that the standardised central moment gives us the sum of the $k$th absolute power of the standardised random variable, conditional on it being above or below its mean respectively.

Thus, for any even power $k \geqslant 3$ we will get a measure that gives non-negative values, with higher values occurring if the tails of the distribution of the standardised random variable are fatter. Note that this is a result with respect to the standardised random variable, and so a change in scale (changing the variance) has no effect on this measure. Rather, it is effectively a measure of the fatness of the tails, after standardising for the variance of the distribution. Any of these quantities could reasonably be regarded as a measure of a type of "kurtosis", with higher powers giving greater relative weight to values that are far from the mean.

Since this phenomenon occurs for every even power $k \geqslant 3$, the natural choice for an archetypal measure of kurtosis is to define $\phi_4$ as the kurtosis. This is a lower standardised central moment than the higher even powers, and it is natural to explore lower-order moments before consideration of higher-order moments. In statistics we have adopted the convention of referring to this standardised central moment as the "kurtosis", since it is the lowest standardised central moment that measures this aspect of the distribution. (The higher even powers also measure types of kurtosis, but with greater and greater emphasis on values far from the mean; these are sometimes called measures of "hyperkurtosis".)


$^\dagger$ This equation is well defined for any distribution whose first two moments exist, and which has non-zero variance. We will assume that the distribution of interest falls in this class for the rest of the analysis.

Ben
  • 91,027
  • 3
  • 150
  • 376
2

Similar question What's so 'moment' about 'moments' of a probability distribution? I gave a physical answer to that which addressed moments.

"Angular acceleration is the derivative of angular velocity, which is the derivative of angle with respect to time, i.e., $ \dfrac{d\omega}{dt}=\alpha,\,\dfrac{d\theta}{dt}=\omega$. Consider that the second moment is analogous to torque applied to a circular motion, or if you will an acceleration/deceleration (also second derivative) of that circular (i.e., angular, $\theta$) motion. Similarly, the third moment would be a rate of change of torque, and so on and so forth for yet higher moments to make rates of change of rates of change of rates of change, i.e., sequential derivatives of circular motion...."

See the link as this is perhaps easier to visualize this with physical examples.

Skewness is easier to understand than kurtosis. A negative skewness is a heavier left tail (or further negative direction outlier) than on the right and positive skewness the opposite.

Wikipedia cites Westfall (2014) and implies that high kurtosis arises either for random variables that have far outliers or for density functions with one or two heavy tails while claiming that any central tendency of data or density has relatively little effect on the kurtosis value. Low values of kurtosis would imply the opposite, i.e., a lack of $x$-axis outliers and the relative lightness of both tails.

Carl
  • 11,532
  • 7
  • 45
  • 102
  • Skewness is the balance point of the pdf of $Z^3$, and kurtosis is the balance point of the pdf of $Z^4$. Both transformations "stretch" the tails, kurtosis more. If the pdf of $Z^3$ falls to the right when a fulcrum is placed at 0, then there is positive skew in the original distribution. If the pdf of $Z^4$ falls to the right when a fulcrum is placed at 3.0, then the original distribution is heavier-tailed than the normal distribution. Here, "heaviness of tails" refers to more precisely to leverage than to mass. Moors' interpretation is not quite right wrt both mentions of "concentration." – BigBendRegion Feb 02 '19 at 17:47
  • @PeterWestfall I agree that Moors' interpretation is imperfect. Precise language is not easily achievable without also being confusing. Take "leverage" for example. Leverage means first moment and one would have to invent something like "leveraged leverage" for the second moment, which might confuse more than illuminate. Your approach appears to invent a novel concept, i.e., "stretched leverage," which hints at geometric transforms for which one might also claim some advocates who favor it as self-consistent at the risk of also being controversial, and non-physical for others. – Carl Feb 02 '19 at 23:20
  • "Leverage" refers to the first moment of the variable $U$, where $U = Z^4$. It's not rocket science. – BigBendRegion Feb 03 '19 at 13:40
  • @PeterWestfall Not to be too punny, but you are leveraging leverage. Sure, you can still use the word, and if $Z^4$ were not a fourth dimensional object, as compared to a one dimensional distance, $Z$, it might be even be useful. The context here is that of moments, and creating a physical model for moments. There are several ways that can be done, for example, see my answer about that [here](https://stats.stackexchange.com/a/324197/99274). In other words, to put moments into any physical context, we have to do more than hand-waving and invocation of the fourth dimension. – Carl Feb 04 '19 at 04:27
  • @PeterWestfall In the context of circular motion, we would call the second moment *torque*, and not the leverage of $Z^2$, which latter, although not incorrect, does not bring anything physical to mind. – Carl Feb 04 '19 at 04:31
  • Please reread my comment above. The leverage concept clarifies the meaning of kurtosis. Place a fulcrum at 3.0 on the horizontal line of the pdf of $U = Z^4$. Which direction does the pdf fall? To the right or to the left? That explain the tail concept pretty well, n'est-ce pas? No need for 4 dimensions. Just one - $Z^4$ is a one-dimensional random variable. The fulcrum lies on its one dimension. Not rocket science. – BigBendRegion Feb 04 '19 at 22:34
  • Also, the physical understanding of variance as the first moment of $(X-\mu)^2$ indeed provides an excellent visual representation to the understand the effect of outliers on variance. – BigBendRegion Feb 04 '19 at 23:49
  • @PeterWestfall Mean, fulcrum, axis of rotation, whatever, but not leverage. That latter is inaccurate. – Carl Feb 05 '19 at 14:07
  • Why? I thought that if a fulcrum is placed on a line, the forces acting on each side referred to leverage. Archimedes moving the Earth and all that. What am I missing? – BigBendRegion Feb 06 '19 at 13:14
  • Keeping this simple, one does not generally expect a derivative to have the same physical units as a function one is differentiating. For example, velocity as meters per second does not have the same units as distance in meters, and rate of change of torque does not have the same units as torque in newton meters. I would resist saying rate of change of leverage as leverage is measured in newton meters. So variance of "leverage" does not have the same units as "leverage". – Carl Feb 06 '19 at 15:30
  • Ok, so maybe kurtosis is better called "tail force" rather than "tail leverage"? Either way, Moors is off base - higher kurtosis is explained by tail force/leverage, not the "concentration" he mentions, as is obvious from the physical force/leverage 1-D moment representation of kurtosis. Here is a nice counterexample to Moors' two "concentration" interpretations: https://math.stackexchange.com/a/2510884/472987 : As kurtosis increases, mass concentrates neither near the mean nor in the tail. But the distribution of $Z^4$ is a simple three-point pdf showing the force/leverage concept nicely. – BigBendRegion Feb 06 '19 at 19:14
  • @PeterWestfall I think kurtosis for time as the independent variable would be rate of change of rate of change of rate of change of torsion, e.g., that would be in Newton-meters per second$^3$. Force (e.g., in Newtons, where F=ma) is already ML$^2$T$^{-2}$ making kurtosis to have classical units (Mass, Length, Time) of ML$^3$T$^{-5}$. I can agree with this much, interpreting kurtosis is not trivial. – Carl Feb 06 '19 at 20:09
  • Hmmm, that's not particularly helpful. Kurtosis applies to all sciences, social, biological, medical, physical and engineering; in all cases it is simply the point of balance of the distribution of $U = Z^4$. That explains everything quite simply. – BigBendRegion Feb 06 '19 at 22:28
  • @PeterWestfall Is an axis of rotation a balance point? Kurtosis is the forth moment about the mean, and sometimes kurtosis is undefined, e.g., for the very common Cauchy distribution, for the Student's $t$ distribution with df < 4, for the Pareto distribution with shape < 4, and so forth. Thus, the claim that it **is** some sort of balancing act is even more flawed than the concept of mean value being an "expectation." For example, the expected value of a Cauchy distribution is not meaningful. A mathematical form with a universal meaning is not so inconsistent. – Carl Feb 07 '19 at 03:20
  • Can you say "red herring"? Obviously we are referring to cases with finite fourth moment. My logic stands. Refute it if you can. – BigBendRegion Feb 08 '19 at 13:08
  • @PeterWestfall I am not refuting your internal sense of logic, only the words used to misrepresent what might otherwise be used to express something logical. In what physical dimension does this mythical shifted "balance" point exist? I have trouble picturing an offset "balance point" for an axis of rotation. Perhaps you are referring to locus of a higher moment of inertia, and the term "moments" derives comes from an inertial reference. Then, as now, physicality trumps hand-waving, and analogies are either exact or they are metaphysical. Consider the following sentence.... – Carl Feb 08 '19 at 13:41
  • @PeterWestfall con't ..." Notice that the parallel axis theorem is used to shift the moment of inertia from the center of mass to the pivot point of the pendulum" from [Wikipedia](https://en.wikipedia.org/wiki/Moment_of_inertia#Examples_2). Now the second moment is not the fourth, nevertheless any complete description of the fourth moment should either be general enough to fit various physical models, or it is at best insufficient. – Carl Feb 08 '19 at 13:49
  • Let $v = z^4$. The $v$ is in 1 dimension; no myth. Example: Data are 0, 3, 4, 1, 2, 3, 0, 2, 1, 3, 2, 0, 2, 2, 3, 2, 5, 2, 3, 999. $z$ values are −0.239, −0.225, −0.221, −0.234, −0.230, −0.225, −0.239, −0.230, −0.234, −0.225, −0.230, −0.239, −0.230, −0.230, −0.225, −0.230, −0.216, −0.230, −0.225, 4.359. $v = z^4$ values are 0.003, 0.003, 0.002, 0.003, 0.003, 0.003, 0.003, 0.003, 0.003, 0.003, 0.003, 0.003, 0.003, 0.003, 0.003, 0.003, 0.002, 0.003, 0.003, 360.976. Locate these points *one dimensionally* on the $v$ axis with relative frequency = mass. The point of balance = kurtosis, 18.05. – BigBendRegion Mar 03 '19 at 16:51
  • @PeterWestfall Alternatively, note that the fourth power of the $L^4$ norm of an $n$-space vector, **v**, is $\Sigma_{i=1}^n(v_i)^4$, see definition of [norm](https://en.wikipedia.org/wiki/Lp_space#Definition). Now note that statistics is a field where outcome vectors, **x**, are written $X$, which has caused me endless confusion because that not so subtle fact, that random variables are actually vectors, is implied everywhere and stated nowhere. To the point, when [kurtosis is calculated](https://en.wikipedia.org/wiki/Kurtosis#Pearson_moments), the RV is used as a vector. – Carl Mar 03 '19 at 19:04
  • @PeterWestfall This discussion is concerning something that is not falsifiable. I suggest reading the Wikipedia entry on [Not even wrong](https://en.wikipedia.org/wiki/Not_even_wrong) for an entertaining presentation of this. To wit, we are speaking to different contexts, and what you say is not wrong, just out of the context in which I am trying to understand physical moments. – Carl Mar 03 '19 at 20:29
  • The term "not falsifiable" connotes triviality. But no. The balance result is important. To get back to the point, the one-dimensional balance representation of kurtosis proves Moors' interpretation wrong: If the kurtosis of $X_2$ is larger than that of $X_1$ (say $\kappa_1$), then the pdf of $V_2$ falls to the right when a fulcrum is placed at $\kappa_1$. Thus, greater kurtosis implies neither (i) greater mass in the center of the distribution (the curve falls to the right, not to the left), nor (ii) more mass in the tail (place less mass farther out, and the curve still falls to the right). – BigBendRegion Mar 04 '19 at 17:00
  • @PeterWestfall I am not trivializing what you are saying, but there is a communication problem. First, we are talking about density functions. Density functions do require anything to do with probability, they can, for example represent the instantaneous amount of water in a lake. Second, density functions, more often than not are everywhere continuous, and there is no requirement that they be RV's, and when they are discrete we call them mass functions. Third, if we are talking about RV's they are the same as $n$ dimensional vectors because that is how they are being used.... – Carl Mar 05 '19 at 01:29
  • @PeterWestfall ....to calculate kurtosis. I avoid calling density functions random variables even when probability is implied by the problem type because it leads to confusion and sloppy thinking, even though one can stretch this definition by using parametric equations. Fourth, the Wikipedia [kurtosis](https://en.wikipedia.org/wiki/Kurtosis#Interpretation) entry, language aside, states something similar to what you are saying directly above Moors' interpretation. Is that what you are paraphrasing? – Carl Mar 05 '19 at 01:39
  • @PeterWestfall Despite the language problems, I think I get your point, and have changed the post accordingly. If you are keeping score, you can take that as a win. – Carl Mar 05 '19 at 04:44