Estimate normal distribution from dnorm in R

Question

The function dnorm(x) in R gives you the value of the probability density function in the points x of a certain normal distribution (mean = 0 and SD = 1 by default), returning a vector of the same length than x.

However, I want to do the opposite: given a vector that approximates a probability density function (like the result of dnorm), I want to get the mean and standard deviation of the normal distribution represented by that given probability density. Code example of what I would like to do:

pdf = dnorm(seq(-3,3,0.1), mean = 0, sd=1)
## get_normal would be supposes to return an list/vector containing the mean and SD, which in this particular case should be close to 0 and 1 respectively.
var_parameters = get_normal(pdf)

Are the evaluation points known? If yes, then it's fairly trivial. If not, there is not much you can do. — Jarle Tufto, Jul 18 '21 at 19:47
And could you elaborate on what you mean by "a vector that approximate a probability density function"? — Jarle Tufto, Jul 18 '21 at 19:51
It sounds like you're asking about a [tag:maximum-likelihood] estimator. — Sycorax, Jul 18 '21 at 19:59
@jarle-tufto Yes, the evaluation points are known. If you take a look at the code, pdf is a vector of evaluation points (in this specific case, the evaluation points of the density function f(x) of gaussian of mean = 0 and sd = 1, with x taken values from -3 to 3 with a step of 0.1). I wrote "a vector that approximate a probability density function" because the R vector pdf is discrete, and thus it is an approximation of the actual density function (whose parameters I don't know and want to find) If anything is still unclear, just tell me. Thanks! — BN-stats, Jul 18 '21 at 20:05
@bigbendregion as far as I understand, Riemann sums are an integration method. However, I don't see how integrating can help me solve this problem. Sorry if I keep asking too much, I am fairly new in statistics and I am a bit confused at the moment. — BN-stats, Jul 18 '21 at 20:07
The mean and variance are integrals that are easily estimated using your sequence and Riemann sums. — BigBendRegion, Jul 18 '21 at 20:12
@bigbendregion Sorry for asking again. But how could I formulate such integrals? I still don't see how integration can help me solve the problem — BN-stats, Jul 18 '21 at 20:21
Why do you say that the "vector _approximate_ a probability density function"? — Jarle Tufto, Jul 18 '21 at 20:27
@jarleTufto As mentioned, I said that the vector approximates the density function because the vector is limited and discrete, whereas the function is continuous and infinite. The points of the vector are not approximations but actual evaluations of the function, but the vector (as a set) is an approximation given its finite and discrete nature. — BN-stats, Jul 18 '21 at 21:44
@BigBendRegion Thanks you for the answer (the message cut). A bit offtopic, but ```integrate()``` in R seems to messed up the integration between -Inf and Inf when the mean is big and the sd small (try to integrate with dnorm setting mean = 60, sd = 1). I decided to integrate between ```mean-5sd``` and ```mean+5sd``` and works just fine (although with a small error to take into account). — BN-stats, Jul 18 '21 at 21:52

BruceET · Accepted Answer · 2021-07-19T06:30:24.277

For a normal density function $f,$ if you have a grid of points X and corresponding density values $y = f(x),$ then you can use numerical integration to find $\mu$ and $\sigma.$ [See Note (2) at the end.]

If you have many realizations $X_i$ from the distribution, you can estimate the population mean $\mu$ by the sample mean $\bar X$ and the population SD $\sigma$ by the sample SD $S.$

Another possibility is to use a kernel density estimator (KDE) of $f$ based on a sufficiently large sample. In R the procedure density gives points $(x, y)$ that can be used to plot a density estimator.

set.seed(718)
x = rnorm(100, 50, 7)
mean(x);  sd(x)
[1] 50.62287
[1] 6.443036

hist(x, prob=T, col="skyblue2")
 rug(x);  lines(density(x), col="red")

In R, the KDE consists of 512 points with values summarized as below:

density(x)

Call:
        density.default(x = x)

Data: x (100 obs.);     Bandwidth 'bw' = 2.309

       x               y            
 Min.   :31.36   Min.   :1.974e-05  
 1st Qu.:41.69   1st Qu.:3.239e-03  
 Median :52.03   Median :2.371e-02  
 Mean   :52.03   Mean   :2.417e-02  
 3rd Qu.:62.36   3rd Qu.:4.378e-02  
 Max.   :72.70   Max.   :5.566e-02

You can estimate $\mu$ and $\sigma$ corresponding to the KDE as follows:

xx = density(x)$x
yy = density(x)$y              # (xx, yy) is KDE plot point
sum(xx*yy)/sum(yy)
[1] 50.62329                   # aprx pop mean = 50
sum((xx-50.62)^2 * yy)/sum(yy)
[1] 46.42294                   # aprx pop variance = 49
sqrt(sum((xx-50.62)^2 * yy)/sum(yy))
[1] 6.813438                   # aprx pop SD = 7

Because $\bar X$ and $S$ are sufficient statistics for $\mu$ and $\sigma,$ it is hard to imagine that $\hat \mu$ and $\hat \sigma$ re-claimed from a KDE (based on data) would be systematically better than the sample mean $\bar X = 50.62$ and SD $S = 6.44.$ I mention the KDE method because it seems possibly related to your question.

Notes: (1) Of course there are also methods for estimating $\bar X$ and $S$ from a histogram, but they can be very inaccurate for small samples.

(2) Here is a numerical evaluation of $\mu \approx \int_0^{100} x\varphi(x,50,7)\, dx,$ using the sum of areas of 1000 rectangles.

m = 1000
w = (100-0)/m
x = seq(0+w/2, 100-w/2, len=m) 
f = x*dnorm(x, 50, 7)
sum(w*f)
[1] 50   # mu
f2 = (x-50)^2*dnorm(x,50,7)
sum(w*f2)
[1] 49   # sigma^2

Hi! I found your answer the most complete by far. I finally opted by just finding the mean and sd using integration in R or numerical integration (your last piece of code). Both work well, but the function ```integrate()``` of R seems to have problem integrating between -Inf and Inf when the mean is big and the sd small, you have to manually tune the integration interval (I set mean-5*sd and mean+5*sd). KDE seems more suitable when handled a sample instead of a grid of evaluation points of the function. Still, much appreciated! — BN-stats, Jul 18 '21 at 21:56

score 3 · Answer 2 · answered Jul 18 '21 at 21:25

A very simple, general-purpose solution: First, write a function that takes parameters as an input, and returns the different between the predicted PDF for those parameters and the actual PDF (I've used the sum of squared differences here). Then, use optim() to find the parameters than minimise this function.

x = seq(-3,3,0.1)
pdf = dnorm(x, mean = -.5, sd = .2)
f = function(pars){
  pred_pdf = dnorm(x, mean = pars[1], sd = pars[2])
  err = sum((pdf - pred_pdf)^2)
}
result = optim(c(0, 1), f) # c(0, 1) are initial values
round(result$par, 3)
# [1] -0.5  0.2

This answer is particularly interesting to me, since I'm doing this in relation with an optimization and Machine Learning project! Specifically, I do this to find the parameters of a geometrical mean of two normal density function. I think I will start using the solution proposed by @BruceET, but your seems perfectly valid for my case as well — BN-stats, Jul 18 '21 at 22:00

Estimate normal distribution from dnorm in R

2 Answers2