How to efficiently calculate Skewness and Kurtosis of data having value with repetitions?

Question

I am doing some research on stock data and somewhat new to advanced statistics. The data is for example

Price --> Volume

100 ---> 1234

101 ---> 123456

102 ---> 6678

103 ---> 3456

104 ---> 1115

Just looking at the data we can say the data is right skewed and positive kurtosis but how do you calculate it efficiently? I am trying to do it in Simple Statistics and it accepts an array of values. So to calculate from the software i have to pass '100' 1234 times, '101' 123456 times and so on. This is quite inefficient for large datasets. Is there a general method to calculate it. Also any opensource software, preferably in Javascript would be cherry on top.

Why do you have to repeat a value $123456$ times? // R has a package called “moments” that calculates skewness and kurtosis, and Python’s “scipy” package does the same. If you’re going to be doing work in statistics, you’ll do yourself a favor by learning at least one of those languages instead of relying on JavaScript. — Dave, Jan 12 '21 at 10:51
You can think it like this: In the dataset value 100 is repeated 1234 times, value 101 is repeated 123456 times... and so on. They are stored in database. So rather than providing an array of 100s, 101s and so on to calculate, is there an easier way since we have times that the value will be repeated. The problem would be same whether in R or Python or any other language. — rockfight, Jan 12 '21 at 13:39
In R: “moments::skewness(c(rep(100, 1234), rep(101, 123456), rep(102, 6678), rep(103, 3456), rep(104, 1115)))” — Dave, Jan 12 '21 at 14:01

score 2 · Accepted Answer · answered Jan 12 '21 at 14:38

We can take many approaches to arrive at the solution. A conceptually and mathematically elegant one begins by characterizing the skewness and kurtosis as properties of the empirical distribution. This is the distribution of the random variable $X$ defined by putting $n_1$ tickets labeled with the number $x_1,$ $n_2$ tickets labeled with $x_2,$ and so on, into a box and withdrawing one of those tickets at random. This latter phrase means each ticket has the same chance of being withdrawn. Consequently, the chance $p_i$ of observing $x_i$ (for any $i$) must be the number of tickets with $x_i$ on them (equal to $n_i,$ provided all the $x_i$ are distinct) divided by all the tickets. Thus, when there are $d$ distinct values on the tickets,

$$p_i = \frac{n_i}{n_1+n_2+\cdots+n_d}.$$

The rest is a matter of applying the definitions.

The mean (aka "expectation") is the sum of the values times their probabilities, $$\mu_1 = E[X] = \sum_{i=1}^d p_i x_i.$$
For any $k=1,2,3,\ldots,$ the $k^\text{th}$ central moment is the expectation of $(X-\mu_1)^k:$ $$\mu_k = E[(X-\mu_1)^k] = \sum_{i=1}^d p_i (x_i - \mu_1)^k.$$
For any such $k,$ the standardized central moment $\beta_k$ is the expectation of $Z^k$ where $Z = (X-\mu_1)/\sqrt{\mu_2}.$ With a little algebra this simplifies to $$\beta_k = \frac{\mu_k}{\mu_2^{k/2}}.$$

The skewness is $\beta_3$ and the kurtosis is $\beta_4.$ (Sometimes "kurtosis" refers to the "excess kurtosis," which is $\beta_4 - 3.$)

Example

Here is a simplified version of the data in the question, where the counts have been reduced so the arithmetic details are less distracting.

$$\begin{array} &i & x_i & n_i \\ \hline 1 & 100 & 1\\ 2 & 101 & 9\\ 3 & 102 & 6\\ 4 & 103 & 3\\ 5 & 104 & 1 \end{array}$$

These are the calculations:

$n_1 + \cdots + n_d = 1+9+6+3+1=20.$
$p_1=1/20, p_2=9/20, p_3=6/20, p_4=3/20, p_5=1/20.$
$\mu_1 = \frac{1}{20}(100) + \frac{9}{20}(101) + \frac{6}{20}(102) + \frac{3}{20}(103) + \frac{1}{20}(104-101.7)^2 = 101.7.$
$\mu_2 = \frac{1}{20}(100-101.7)^2 + \cdots + \frac{1}{20}(104-101.7)^2 = 0.91.$
$\mu_3 = \frac{1}{20}(100-101.7)^3 + \cdots + \frac{1}{20}(104-101.7)^3 = 0.546.$
$\mu_4 = \frac{1}{20}(100-101.7)^4 + \cdots + \frac{1}{20}(104-101.7)^4 = 2.3557.$
$\beta_3 = 0.546 / (0.91^{3/2}) \approx 0.628971.$
$\beta_4 = 2.3557 / (0.91^{4/2}) \approx 2.8447.$

In many computing platforms it is convenient to compute the residuals $x_i-\mu_1$ once and for all because these terms recur in all the formulas for the $\mu_k.$ This leads to easy spreadsheet formulas and to straightforward code, such as this R example of the preceding calculations. Their connections to the foregoing formulas should be obvious.

x <- 100:104
n <- c(1,9,6,3,1)
p <- n / sum(n)
mu.1 <- sum(p*x)
r <- x - mu.1
mu.2 <- sum(p*r^2)
mu.3 <- sum(p*r^3)
mu.4 <- sum(p*r^4)
beta.3 <- mu.3 / mu.2^(3/2)
beta.4 <- mu.4 / mu.2^(4/2)

How to efficiently calculate Skewness and Kurtosis of data having value with repetitions?

1 Answers1

Example