2

I have the following plot:

A histogram with apparently overlapping circular normal distributions

Overlapping circular normal distributions

I would like to estimate the means and standard deviations of the apparent overlapping normal distributions. This is slightly complicated by the fact that since the data is based on hour of the day, it also is circular -- the right end of the tail(s) leak into the left end.

How do I handle this?

2 Answers2

2

I'd like to thank Adrian Keister and Robert Dodier for the start and the GitHub project provided by Emily Grace Ripka: Peak fitting Jupyter notebook

I was able to approximate the two different overlapped distributions with von Mises distributions and then optimized the predictions to minimize the error by selecting the mean and kappa (equivalent to the standard deviation of a von Mises distribution).

I was able to accomplish this with the SciPy Python module classes: scipy.stats.vonmises and scipy.optimize.curve_fit

I created the following two helper functions:

def two_von_mises(x, amp1, cen1, kappa1, amp2, cen2, kappa2):
    return (amp1 * vonmises.pdf(x-cen1, kappa1)) + \
           (amp2 * vonmises.pdf(x-cen2, kappa2))

def one_von_mises(x, amp, cen, kappa):
    return amp * vonmises.pdf(x-cen, kappa)

I needed to convert the time of day to an interval range from -pi <= {time of day} < pi, like so:

hourly_df['Angle'] = ((two_pi * hourly_df['HourOfDay']) / 24) - np.pi

I was then able to use the curve_fit function of the scipy.optimize module like so:

popt, pcov = curve_fit(two_von_mises, hourly_df['Angle'], hourly_df['Count'], p0 = [1, 11, 1, 1, 18, 1])

From this I got all the estimates of the parameters for the two distributions (from the popt variable above):

array([1.66877995e+04, 2.03310292e+01, 2.03941267e+00, 3.61717300e+04,
       2.46426705e+01, 1.32666704e+00])

Plotting this we see: Data with superimposed von Mises pdf graphed The next steps will be to see if we can determine what distribution a query belongs to based on categorical data collected for each query, but that is another story...

Thanks!

1

Partial solution

Here's what I would try first:

  1. Copy the data and paste on either side so that you have at least one full cycle with unbroken peaks.
  2. Find the max point, and call that the mean of one of the distributions.
  3. From the max point, decide which direction the other distribution is relative to it. In the picture above, the second distribution is to the right of the first one, because the falling curve is monotonic to the left, but not to the right. Another way to do this is to take the data about the max point, and "fold it" on itself: whichever direction is further away from the max, on average, is the side containing the second distribution.
  4. In the direction opposite the second distribution, find the inflection point.
  5. Use the horizontal distance from the mean to the inflection point as an approximation of the standard deviation. I realize this is a circular normal, not a regular normal, but this estimate should still be fairly good.
  6. Use the equation $$\sigma^2=1-\frac{I_1(\kappa)}{I_0(\kappa)}$$ to solve for $\kappa$ numerically.
  7. Now that the first distribution is fully known, subtract it from the full dataset. What remains should be mostly due to the second distribution. Repeat steps 4-6 (mostly) on the second distribution.

This is a partial solution, mainly because I am unsure how good an approximation Step 5 is for the standard deviation. You could try plotting a few known circular normals and their inflection points to compare, and maybe that would offer a correction that would enable you to get a better approximation.

Hopefully, this procedure will at least nudge you into a profitable direction.

Adrian Keister
  • 3,664
  • 5
  • 18
  • 35