What is the purpose of multiplying by the difference between the midpoints of two bins in this recipe?

Question

While browsing for information on how I might plot a fitted normal curve over a histogram, I found the following:

http://www.statmethods.net/graphs/density.html

There is a line I don't fully understand, though I recognize that it really does work:

yfit <- yfit*diff(h$mids[1:2])*length(x)

Here, yfit is initially a list of values drawn from the pdf of an inferred normal distribution at regular intervals along the x-axis, length(x) is the number of observations in a list x from which a histogram was prepared, and diff(h$mids[1:2]) is the difference between the midpoints of the second and first bars of said histogram on the x-axis. After this statement is run, yfit becomes itself multiplied by those other two terms.

I understand that multiplying by length makes sense as this turns values for a probability distribution function into number of observations around each respective value—taking into account that a continuous pdf is being used here and the number of observations at any single point is zero.

I don't understand why it is necessary to multiply by diff(h$mids[1:2]) to get the right outcome in the graph, although I can confirm that it does get the right outcome.

Does anyone have an explanation?

Perhaps you will find this question answered at http://stats.stackexchange.com/questions/4220 or even http://stats.stackexchange.com/questions/133369. If not, then please edit it to explain what the terms in this code mean: it's important that your question be understandable on its own without requiring readers to visit another site. — whuber, May 26 '16 at 20:30
I believe it is *precisely* about not understanding that `yfit` is a density and that the "histogram" you mention is not a histogram at all, but rather is a bar chart (showing frequencies rather than frequency densities). I see nothing `R`-specific about this procedure, which is a standard one. — whuber, May 26 '16 at 20:54
Think of it this way: `yfit` gives the heights of rectangles. `diff(h$mids[1:2])` gives their bases. The product gives their areas. The so-called "histogram" is plotting *areas* (that is, frequencies) by means of bars whose *heights* represent the areas. So it all comes down to the formula for the area of any rectangle, area = base * height. This is explained in the links I first provided. — whuber, May 26 '16 at 20:59
"diff(h\$mids[1:2])*length(x)" - is the same as doing "h\$counts/h\$density". It is a multiplier which takes yfit (which is a density distribution) and scales it to frequencies exhibited in your data. — Mina, May 26 '16 at 21:04

score 0 · Answer 1 · edited Sep 08 '17 at 11:50

Since the histogram is a bar chart with area = height (yfit from dnorm) times base ”diff(h$mids[1:2])” the area converts the bar chart area to a probability area so final yfit (which is freq of occurrence) becomes probability (or area) times number of observations classical formula is

$prob = \frac{freq occurrence}{total possible occurrence}$

Here is the mapping to classical formula

yfit             =     yfit * diff(hmids[1:2]) *   length(x) 
freq occurrence  =     probability area         *   total occurrences

What is the purpose of multiplying by the difference between the midpoints of two bins in this recipe?

1 Answers1