6

Using either truehist() from MASS or just the normal hist() function in R with the prob=TRUE option, I'm getting very strange values for the y-axis. I was under the impression that these values should all be below 1.00, as the relative frequency of any value should be below 1.00 and the area under the curve adds to that.

Instead, I'm getting axes with ranges toward 1500, and step sizes in the hundreds. Does anyone know what's going on? The values aren't event consistent, so it doesn't seem like they've got any relative scaling to them. For reference, I'm using the following code:

hist(g1$Betweenness, main="", xlab="Betweenness", sub="Generation 1", prob=TRUE)

The data for one such plot: 0.009619951 0.009619951 0.006750843 0.006750843 0.006750843 0.006750843 0.014497435 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.006750843 0.008663582 0.008663582 0.006750843 0.012058693 0.012489059 0.024587132 0.084941213 0.01248905 0.012489059

Annoyingly, JMP handles this just fine, but I've come to prefer R's plotting style.

Fomite
  • 21,264
  • 10
  • 78
  • 137
  • Can you post a plot of what you're getting? I get a very sensible histogram based on the data you posted. Step size is 0.01 and densities from 0 to 80. – cardinal Oct 19 '11 at 09:09
  • Also, note you can do, for example, `hst – cardinal Oct 19 '11 at 09:11
  • @cardinal I get the same plot, but from the R documentation, saying freq = FALSE or prob = TRUE should result in a histogram area of 1. What does a density of 80 *mean*? The breaks do indeed sum to 1. – Fomite Oct 19 '11 at 09:12
  • 1
    Well, since in this case the step size is 0.01, it means that $80 \cdot 0.01 = 0.8 = 80\%$ of the data lie within that bin. Same intepretation as the area under the curve of a pdf taken over some interval. :) – cardinal Oct 19 '11 at 09:14
  • @cardinal And now it begins to make more sense. I was missing the multiplied by step size bit, so was expecting to see something in the range of 0.8 and instead saw 80. – Fomite Oct 19 '11 at 09:17
  • 1
    To try to keep my head straight, I tend to think about histogram density plots in terms of Riemann sums. But, I don't know if that works for others. – cardinal Oct 19 '11 at 09:30
  • This is really the question at http://stats.stackexchange.com/questions/4220 in a different guise. – whuber Jul 25 '13 at 13:35

3 Answers3

3

One explanation is that the standard deviation of your data is much less than one, and the histogram is giving something like the probability density.

For example, see how the density on the histogram changes when I divide a uniform random variable with range (0, 1) by 1000:

set.seed(4444)
x <- runif(100)
y <- x / 1000

par(mfrow=c(2,1))
hist(x, prob=TRUE)
hist(y, prob=TRUE)

enter image description here

If you want more intuitive looking density values, you could possibly change the units of the variable.

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
  • That looks to indeed be the answer. I'm hesitant to rescale it - may just have to do some adjustments to what I intended to graph. Thanks. – Fomite Oct 19 '11 at 09:06
2

As other have noted, frequency=FALSE only makes the integral over the histogram equal to 1, not the sum over all values. (The parameter probability=TRUE is only there for S compatibility, by the way, and probably therefore a misnomer. Probability density would be better.)

Here is some code to relabel the y axis to plot probabilities as tick marks.

my.data <- rnorm(2000)

my.hist <- hist(my.data, breaks=100, yaxt='n', ylab="Probability")

ticks <- seq(par("yaxp")[1], par("yaxp")[2], length.out=par("yaxp")[3]+1)
l <- length(my.data)
max.prob <- max(my.hist$counts)/l
tick.labels <- head(pretty(c(0, max.prob)), -1)
ticks <- tick.labels * l
print(tick.labels)
print(sum(my.hist$counts/l))

axis(2, at=ticks, labels=tick.labels)

See this image for an example output:

Histogram with probability y axis

quazgar
  • 153
  • 1
  • 8
1

If you pass probability=TRUE (or frequency=FALSE) you should indeed see densities on the plot.

Note that this does not make it impossible for them to be above 1 if your number of breaks is relatively low and your bin widths are small (way below 1). Looking at the code of hist.default you can see that the densities are calculated as dens <- counts/(n * diff(breaks)).

What goes wrong in your case is hard to say without a look at the data itself (certainly the bin widths are broad enough to warrant small density values, per your explanation). However, I seem to recall that there were issues with hist in some relatively recent version of R. So maybe you can update to the latest version and try again?

Nick Sabbe
  • 12,119
  • 2
  • 35
  • 43
  • Updated my version of R, which doesn't seem to have fixed things. I've also posted the full data for one of the plots I'm trying to make. – Fomite Oct 19 '11 at 08:57
  • As noted elsewhere: the problem was that your step sizes were _not_ in the hundreds... – Nick Sabbe Oct 19 '11 at 09:30