2

While browsing for information on how I might plot a fitted normal curve over a histogram, I found the following:

http://www.statmethods.net/graphs/density.html

There is a line I don't fully understand, though I recognize that it really does work:

yfit <- yfit*diff(h$mids[1:2])*length(x)

Here, yfit is initially a list of values drawn from the pdf of an inferred normal distribution at regular intervals along the x-axis, length(x) is the number of observations in a list x from which a histogram was prepared, and diff(h$mids[1:2]) is the difference between the midpoints of the second and first bars of said histogram on the x-axis. After this statement is run, yfit becomes itself multiplied by those other two terms.

I understand that multiplying by length makes sense as this turns values for a probability distribution function into number of observations around each respective value—taking into account that a continuous pdf is being used here and the number of observations at any single point is zero.

I don't understand why it is necessary to multiply by diff(h$mids[1:2]) to get the right outcome in the graph, although I can confirm that it does get the right outcome.

Does anyone have an explanation?

readyready15728
  • 417
  • 1
  • 3
  • 13
  • Perhaps you will find this question answered at http://stats.stackexchange.com/questions/4220 or even http://stats.stackexchange.com/questions/133369. If not, then please edit it to explain what the terms in this code mean: it's important that your question be understandable on its own without requiring readers to visit another site. – whuber May 26 '16 at 20:30
  • I believe it is *precisely* about not understanding that `yfit` is a density and that the "histogram" you mention is not a histogram at all, but rather is a bar chart (showing frequencies rather than frequency densities). I see nothing `R`-specific about this procedure, which is a standard one. – whuber May 26 '16 at 20:54
  • So do you know why diff(h$mids[1:2]) is needed? – readyready15728 May 26 '16 at 20:55
  • 1
    Think of it this way: `yfit` gives the heights of rectangles. `diff(h$mids[1:2])` gives their bases. The product gives their areas. The so-called "histogram" is plotting *areas* (that is, frequencies) by means of bars whose *heights* represent the areas. So it all comes down to the formula for the area of any rectangle, area = base * height. This is explained in the links I first provided. – whuber May 26 '16 at 20:59
  • "diff(h\$mids[1:2])*length(x)" - is the same as doing "h\$counts/h\$density". It is a multiplier which takes yfit (which is a density distribution) and scales it to frequencies exhibited in your data. – Mina May 26 '16 at 21:04

1 Answers1

0

Since the histogram is a bar chart with area = height (yfit from dnorm) times base ”diff(h$mids[1:2])” the area converts the bar chart area to a probability area so final yfit (which is freq of occurrence) becomes probability (or area) times number of observations classical formula is

$prob = \frac{freq occurrence}{total possible occurrence}$

Here is the mapping to classical formula

yfit             =     yfit * diff(hmids[1:2]) *   length(x) 
freq occurrence  =     probability area         *   total occurrences
Ferdi
  • 4,882
  • 7
  • 42
  • 62