3

In docs.scipy.org there's code to sample data from a Pareto distribution and then fit a curve on top of the sampled data. I could understand most of the code snippet except the term max(count)*fit/max(fit) in the call to plt.plot.

Here's the code snippet:

import matplotlib.pyplot as plt
a, m = 3., 2.  # shape and mode
s = (np.random.pareto(a, 1000) + 1) * m
count, bins, _ = plt.hist(s, 100, normed=True)
fit = a*m**a / bins**(a+1)
plt.plot(bins, max(count)*fit/max(fit), linewidth=2, color='r')
plt.show()

I thoroughly searched the web for the formula: max(count)*fit/max(fit) Even replaced the term 'fit' with pdf. But could not get any leads. Kindly explain the concept of what the formula is conveying.

I assumed the term 'fit' is used instead of PDF as they are using the formula of PDF for Pareto distribution for fit.

jwalton
  • 153
  • 1
  • 4
Bipin
  • 131
  • 3
  • 3
    I am voting to leave this open. Even though it's written with code, the question is really about a formula that could be written in any language or just as a formula. – Peter Flom Mar 10 '20 at 11:10

1 Answers1

0

You're correct that fit is defined as the PDF of the Pareto distribution. And as expected, plotting fit by itself (rather than max(count)*fit/max(fit)) produces a line that closely approximates the histogram of random samples.

Dividing fit by max(fit) normalizes fit so that its largest value is 1. Multiplying by max(count) moves this largest value of 1 up to the height of the tallest bar in the histogram. This ensures that the PDF and histogram start at the same y-value, but I'm not sure why this is better than just plotting the unscaled PDF, already computed as fit.

Link to latest documentation (with the same plotting code): https://numpy.org/doc/stable/reference/random/generated/numpy.random.pareto.html

My slight modification, using a=30 for a clearer plot:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)

a, m = 30., 2.  # shape and mode
s = (np.random.pareto(a, 1000) + 1) * m
fit = a*m**a / bins**(a+1)

count, bins, _ = plt.hist(s, 100, density=True, label='1000 Pareto samples, a=30')
plt.plot(bins, max(count)*fit/max(fit), linewidth=2, color='r', label='max(count)*fit/max(fit)')
plt.plot(bins, fit, linewidth=2, color='g', label='fit')

plt.legend()
plt.show()

Comparison of histogram, non-normalized fit, and normalized fit