How to properly bin the data for a fit

Question

I am working on a spectroscopy project in which we adjust the wavelength of a laser and get some counts on the detector from some laser-atom interactions. The data that we have is in the form: $(\lambda$, $dt$, $dN)$, where $dt$ is a time interval, $\lambda$ is the laser wavelength used in that time interval, and $dN$ is the number of events in that time interval.

I need to make a plot of the event rate ($\frac{dN}{dt}$) vs wavelength, and fit it with a Voigt profile. The wavelength is scanned over a long range. However, each individual wavelength is scanned for a short period of time i.e. $dt$ is small, but the difference between 2 consecutive wavelengths is small too. For example, an entry could be $(10000 cm^{-1},0.01 s, 2)$ and the next one could be $(10000.1 cm^{-1},0.01 s,3)$.

I need a bit of help related to how to do the fit properly and get a meaningful number for the peak of the Voigt profile. Given the numbers, it seems that I need to re-bin the data in frequency space (I might use frequency, wavelength or wavenumber interchangeably, what I mean is the x axis which in my case has units of $cm^{-1}$, sorry for that).

Is this a good thing to do? And how should I do the re-binning, as I get slightly different results for each re-binning. Right now I have the value of the peak for several (15) different binnnings, which are quite close, yet a bit different, for example: $11001.5 \pm 0.2$ and $11001.4 \pm 0.3$, where the error is given by the fitting program (I guess it is the standard deviation associated with the best estimate of the parameters, but I can check in more details if needed; I use lmfit in python).

I was thinking to use the mean of these as the reported value, but I am not sure what to use for the error. These numbers are clearly not independent (i.e the value of the peak when I double the bin size is not independent of the value before that, right?) so I can't just use $\sigma/\sqrt{N}$ for the error on mean.

Also how should I take into account the error on each measurement (the $0.2$ and $0.3$ in my examples above)? Or should I try a totally different approach? Any suggestion would be greatly appreciated. Thank you!

You are assuming knowledge many of us do not have. What is a Voigt profile, for example? — mkt, Sep 24 '19 at 09:52
@mkt Voigt is a commonly-used peak equation used in spectroscopy, see https://en.wikipedia.org/wiki/Voigt_profile — James Phillips, Sep 24 '19 at 10:33
Just a question: Why use a full fit to the curve when you are mainly interested in the peak location and height? Have you considered using a specific peak-finding algorithm, e.g. like this one: https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data — mzunhammer, Sep 24 '19 at 12:02
Is your example of different peak location estimates typical? If so, you have no problem, because there is no material difference between $11001.5\pm0.2$ and $11001.4\pm0.3:$ they are indistinguishable within the errors given. This process of varying the bins approximates the process I recently described of varying density estimator bandwidths to explore modes (that is, peaks) of empirical densities: see https://stats.stackexchange.com/a/428083/919. Your study seems to be conducted in a similar spirit, so similar considerations might apply. — whuber, Sep 24 '19 at 12:15
@mzunhammer I actually need more than just the position of the peak from the Voigt profile. I need all the parameters of the curve, so just the location wouldn't be enough. — user260669, Sep 24 '19 at 14:02
Would you please post a link to a single data set for analysis? — James Phillips, Sep 24 '19 at 14:19
@whuber I am not sure I understand. Yes they are consistent, but I still need to quote a mean value of the peak and an error associated to it, which comes from both the spread of the data and the error associated with each value. I am not sure how to properly do this. — user260669, Sep 24 '19 at 15:02
@JamesPhillips I am not really allowed to make the data available. I can make some mock data maybe, but it is as simple as it sound. I need to fit a curve to different binning of the data and give a mean value and an error. — user260669, Sep 24 '19 at 15:04
If you make the bins either very wide or very narrow, that should give incorrect results - but does it make sense from the re-binning to use the one that reports the lowest error, or possibly the one with a peak value nearest to the mean of all the peak values? — James Phillips, Sep 24 '19 at 16:49
@JamesPhillips I tried to keep the number of bins in a reasonable range, such that the result are consistent within the given error. I was thinking, too, to use just the one binning (the one with the lowest error might be a good idea, thank you!), but I am not sure if that is the right way to do it. Binning means that you lose some information, which is reflected in the fact that you get slightly different values for different binning. Would just one binning be a trustworthy result? Should I somehow add the bin width to the error on the value I am getting if I use just one binning? — user260669, Sep 24 '19 at 22:38
If you think that bin width information is useful to an understanding of the work, then it is good to include it. Including the methodology for bin width determination itself might not be a relevant item in conveying an understanding of the work. If you are going to present a plot of the binned data, the bin width would seem relevant. — James Phillips, Sep 25 '19 at 01:34

How to properly bin the data for a fit

0 Answers0