0

I have a serie of data ranging between [0,1]. I'm trying to fit this data using beta distributions. Given an extra point, between the same range, I would like to obtain a significance respect to the distribution. In a normal distrbution, I would apply a two tailed test. But I'm not sure if this same approach could be applied to this kind of distributions. This is what i tried in python:

import scipy.stats as st
import numpy as np

#Value to test
point = 0.5
x = [0.9049999872195,0.751451986627403,0.885290928476777,0.6713085208778,1.0,0.728503282739818,0.807836457809477,0.948326551429743,0.761895930888466,0.961916512107549,0.760188471478065,0.999999947970658,0.891089327026482,0.820396818315191,1.0,0.969559516790842,0.734091364087646,0.875999064916492,0.630937129516064,0.721923452996229,0.907952613526714,0.924912013919574,1.0,0.965971286756229,0.80080862326015,1.0,0.94387279708007,0.766298274737271,0.929777024978723,0.836687729677749,0.820888383321854,0.858704607859888,0.819976719678168,0.91902420209557,0.869881745589804,0.877648363210562,0.969398371648378,0.724543951559856,0.817131742715298,1.0,0.799507714098586,0.998217337576329,0.981524189282259,0.879450908089092,0.786080944971998,0.693436342294618,0.684130194431997,1.0,0.774809207875134,1.0,1.0,0.981714858151415,1.0,0.961000528021392,1.0,0.94205960795591,0.981522644025388,0.986824135526516,0.985527772245646,1.0,0.976412252360526,0.96921614404309,0.92182734457027,1.0,0.803233791242115,0.672799844574505,1.0,0.689429090244783,0.925156258646197,0.912523732052394,1.0]

#Obtain the alpha and beta for modeling the beta distribution
mean = np.mean(x)
var = np.var(x, ddof=1)
alpha = mean ** 2 * (1 - mean) / var - mean
beta = alpha * (1 - mean) / mean

#Get the minimum p_value between the two tails and multiply by 2
p_value1 = st.beta.cdf(float(point), alpha, beta)
p_value2 = 1-float(st.beta.cdf(float(point), alpha, beta))
p_value = min(p_value1,p_value2)*2

I took the idea of the calculation of the p-value from this post: P-value in a two-tail test with asymmetric null distribution.

Maybe I misunderstood it, so any help would be greatly appreciated. Thanks to the community.

EDIT: In order to clarify what I need, I'm gonna show another example: let's generate a random beta distributed data:

from scipy.stats import beta
data=beta.rvs(10,4,loc=0,scale=1,size=200)
mean=numpy.mean(data)
var=numpy.var(data,ddof=1)
alpha1=mean**2*(1-mean)/var-mean
beta1=alpha1*(1-mean)/mean
plt.hist(data,bins=30,normed=True)
pdf=lambda x,a,b:gammaf(a+b)/gammaf(a)/gammaf(b)*x**(a-1)*(1-x)**(b-1)
aux=numpy.linspace(0,max(data),len(data))
plt.plot(aux,pdf(xx,alpha1,beta1),'r')

We obtain the following plot:

enter image description here

I would like to know whether my point of interest is greater or less than my beta distributed values. If the point tested falls into either of the critical areas of the curve.

Thanks again for your time.

JLT
  • 1
  • 4
  • What do you mean by "significance in respect of distribution"? – Tim Jul 18 '17 at 10:55
  • I mean, getting a p-value for the given point, using the cumulative distribution function – JLT Jul 18 '17 at 11:00
  • 1
    What is the hypothesis that you're testing? – Tim Jul 18 '17 at 11:06
  • I just edited my answer. I hope it clarifies what I would like to test – JLT Jul 18 '17 at 12:37
  • 1
    What is the hypothesis you are trying to test? I can't see any hypothesis in here... This has nothing to do with p-values, you seem to be asking about [quantiles](https://stats.stackexchange.com/questions/212813/help-me-understand-the-quantile-inverse-cdf-function/212828#212828) of beta distribution or something similar, but not p-values. – Tim Jul 18 '17 at 12:42
  • Maybe I'm mixing concepts. I thought that, in a CDF, the value corresponding to the 95th percentile would be the same as the same value corresponding to a p-value of 0.05, in a one-sided test right tail. Isn't it? – JLT Jul 18 '17 at 13:08
  • This sounds like you are asking how to compute a *prediction interval* for a single future value based on a random sample from a Beta distribution. It will be wider than the middle $1-\alpha$ portion of the fitted Beta distribution because the parameters of the fitted distribution are uncertain. Rigorously, given iid Beta$(a,b)$ random variables $X_0,X_1,\ldots,X_n$ ($a$ and $b$ unknown) and a number $\alpha$, you seem to be asking for formulas $u$ and $l$ for which $$\Pr(l(X_1,\ldots,X_n)\le X_0 \le u(X_1,\ldots,X_n))=1-\alpha.$$ Would this interpretation be correct? – whuber Jul 18 '17 at 13:18
  • @whuber Yes, thanks for the answer. I think it makes sense. How would you do this? – JLT Jul 18 '17 at 14:02
  • It's probably going to be intractable to work out a theoretical formula, so I would be inclined to bootstrap it or use a Bayesian prediction interval as a substitute. – whuber Jul 18 '17 at 15:05
  • @whuber thanks again for your responses. As you suggested, I was thinking on using Bayesian inference on my data, in order to find the distribution of alpha and beta that fits. It could make sense using the 2.5% and 97.5% of this alpha and beta values in order to obtain the lower and upper distributions and check if my point of interest is out of this distributions? – JLT Aug 07 '17 at 17:31

0 Answers0