Fitting Pareto distribution to data example in SciPy

Question

In docs.scipy.org there's code to sample data from a Pareto distribution and then fit a curve on top of the sampled data. I could understand most of the code snippet except the term max(count)*fit/max(fit) in the call to plt.plot.

Here's the code snippet:

import matplotlib.pyplot as plt
a, m = 3., 2.  # shape and mode
s = (np.random.pareto(a, 1000) + 1) * m
count, bins, _ = plt.hist(s, 100, normed=True)
fit = a*m**a / bins**(a+1)
plt.plot(bins, max(count)*fit/max(fit), linewidth=2, color='r')
plt.show()

I thoroughly searched the web for the formula: max(count)*fit/max(fit) Even replaced the term 'fit' with pdf. But could not get any leads. Kindly explain the concept of what the formula is conveying.

I assumed the term 'fit' is used instead of PDF as they are using the formula of PDF for Pareto distribution for fit.

I am voting to leave this open. Even though it's written with code, the question is really about a formula that could be written in any language or just as a formula. — Peter Flom, Mar 10 '20 at 11:10

score 0 · Answer 1 · answered Feb 09 '22 at 03:27

You're correct that fit is defined as the PDF of the Pareto distribution. And as expected, plotting fit by itself (rather than max(count)*fit/max(fit)) produces a line that closely approximates the histogram of random samples.

Dividing fit by max(fit) normalizes fit so that its largest value is 1. Multiplying by max(count) moves this largest value of 1 up to the height of the tallest bar in the histogram. This ensures that the PDF and histogram start at the same y-value, but I'm not sure why this is better than just plotting the unscaled PDF, already computed as fit.

Link to latest documentation (with the same plotting code): https://numpy.org/doc/stable/reference/random/generated/numpy.random.pareto.html

My slight modification, using a=30 for a clearer plot:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)

a, m = 30., 2.  # shape and mode
s = (np.random.pareto(a, 1000) + 1) * m
fit = a*m**a / bins**(a+1)

count, bins, _ = plt.hist(s, 100, density=True, label='1000 Pareto samples, a=30')
plt.plot(bins, max(count)*fit/max(fit), linewidth=2, color='r', label='max(count)*fit/max(fit)')
plt.plot(bins, fit, linewidth=2, color='g', label='fit')

plt.legend()
plt.show()

Fitting Pareto distribution to data example in SciPy

1 Answers1