3

I have a problem related with probability distributions and parameter estimation, which comes from a real case. I would be very grateful if you could help me.

Let us suppose that we have a continuous amount $M$ of a given product in which the proportion of a certain target component is $p$, where $p$ is supposed to be fixed and unknown. For example, $M=300$ kilograms of product, and $p$ is the proportion of water in it.

We can assume that the target component is sort of randomly distributed within the product.

We are interested in estimating $p$. With that purpose in mind, we randomly take a sample of a fixed and known amount $m$ of product. The value of $m$ is quite small compared with $M$, but I do not really know whether or not it is small enough to actually assume that $M$ is somehow equivalent to $+\infty$.

In our real case, we randomly take $m=15$ kilograms of product within an amount of $M=300$ kg of product.

We measure the amount of the target component in these $m$ units of product, and we calculate an estimate of $p$ as $\hat{p}= \text{Quantity of target component in the sample}/m$.

Ideally, $\hat{p}$ would be always equal to $p$ (I mean, for instance, in the case where we were dealing with a perfectly homogeneous liquid or low-density product —this statement is probably not perfectly expressed, as I am not a chemist). However, in practice, when we are speaking of a product which is solid and is composed of different solid elements with different weights, sizes and so on, but still can be considered to be continuous (I mean, it is not measured in discrete units), $\hat{p}$ is not necessarily equal to $p$. I hope that is clear enough.

My question is: In this context, what is the probability distribution of $\hat{p}$?

Or, equivalently, how can I calculate the probability $\Pr(\hat{p}\le c \;|\; p)$?

If this was a discrete case, I know that $p$ would be the probability associated to a binomial distribution (assuming $M$ is large enough). But I do not know how to deal with this continuous proportion case. Could you please give me some help?

EDIT [April 27th, 2015]:

As I've already said in the comments below, the point is: What (from a physical/technical point of view) makes the observed proportion $\hat{p}$ in the sample of $m$ kilograms not to be always equal to the real one $p$?, and how does it make it? And I do not have a clear answer for that.

The concrete context in which this problem arises is the following: we have a huge amount of paper or paperboard (in relative small pieces) that has been selected from urban waste. That amount of paper contains foreign elements that cannot actually be treated as paper or paperboard.

We select 15 kg of that big package of paper and we measure the amount of strange elements in the sample. And we use this to estimate the real proportion of foreign elements in the big package. The selection of those 15 kg is made as random as possible.

Of course, there are several sources of variability in this process, both in the waste separation process (the process that recovers paper from the urban waste) and in the sample selection. That is why I am not addressing here any data modelling problem, but just looking for a reasonable way to theoretically determine the sampling distribution of $\hat{p}$.

Even in the case the selection of 15 kg was perfectly randomly executed, the fact is that foreign elements are distributed in a way that makes that not any sampled amount of 15 kg has exactly the same amount of those foreign elements. Why...? I'll try to think about this.


EDIT:

According to the description of the 'proportion' tag in this site, my question could be related with the beta distribution. However, I am still not sure about whether or not the situation I have described meets the beta model assumptions, if any.

EDIT:

Should I actually post this question in https://math.stackexchange.com/???


EDIT [April 24th, 2015]:

Based on a comment by @sesqu in this other thread, I deduced that

$\hat{p} \sim \mathrm{Beta}(m p + 1, m(1-p)+1)$,

where (just to summarize)

  • $p$ is the real (fixed and unknown) proportion of a certain target component in a given infinity amount of product,

  • $m$ is the sample quantity of product that we randomly extract from the total amount of product in order to estimate $p$,

  • $\hat{p}$ is an estimation of the real proportion $p$ which is calculated as $\hat{p}= \text{Quantity of target component in the sample}/m$.

Does it make sense?

EDIT [April 26th, 2015]:

As @Scortchi pointed out in a comment to this post, the previous formula seems not to make sense, as it depends on the units $m$ is measured.


EDIT [April 26th, 2015]:

Although I obviously have real data, I would like to point out that they are very likely to come from a mixture of populations. This study about the mixture still has to be done, maybe with ANOVA. But, regardless the results that may arise from the ANOVA, the fact is that the real proportion $p$ is quite unstable. Therefore, it is very difficult and unreliable to feet a distribution based in our data.

That is why I want to try a different approach, related just with calculating sort of control limits based on probability theory.

I thought that it was possible to deduce the sampling distribution of $\hat{p}$, assuming $p$ is constant, and I thought it could be somehow related with the beta distribution, as I see my case like a kind of generalization of a discrete proportion (binomial). That is the reason for my question.

I've read all the comments in this thread so far, but I'm still waiting for more contributions.


EDIT [April 27th, 2015]:

As far as I have understood from different sources, the Beta distribution is mostly used to model the behaviour of the probability of a certain event using prior experience or knowledge (for instance, I liked this explanation "What is the intuition behind beta distribution?" very much). It all is also somehow related with the Bayesian approach in the sense that the probability being studied and modelled is the underlying $p$, the one in the population, we can say.

I think my question is a little bit different, in the sense that I am considering the population target probability $p$ as a constant (classical statistics approach) and looking for modelling the sampling distribution of the statistic $\hat{p}$.

I do not mean that I now think that beta is not the solution; I just mean that maybe my problem is not related with the typical use of the beta distribution.

Also, as @Wolfgang pointed out, I think my problem has to do with compositional data. But, how does it help?

Going back to the beta distribution, the only new idea I have about how to model the probability distribution of $\hat{p}$ (in case we suppose it is beta-distributed) is to assume that the mode (or maybe the average) of it will be equal to $p$...

As you can see, I am still looking for a way to theoretically deduce the sampling distribution of $\hat{p}$, just as it is done with other statistics such as $\bar{x}$ and so on in classical statistics.


EDIT [April 27th, 2015]:

I've been thinking again on the possibility of treating this as a binomial distribution, in the sense that I am somehow counting how many units are correct within a (randomly and independently selected) set on $n$ units, and where each unit has the same probability $p$ of being correct, but with an uncountably infinity amount of sampled units.

I have also discovered that the CDF of a binomial distribution can be expressed as the regularized incomplete beta function. More precisely, if I am not wrong,

$F_X(x \;|\; n,p)= I_{(1-p)}(n-x,1+x)$,

where $F_X$ stands for the CDF of a binomial variable with parameters $n$ and $p$, and $I_z(a,b)$ represents the regularized incomplete beta function.

Would it make sense to try to calculate or estimate $\lim_{n \rightarrow +\infty}{F_X(x \;|\; n,p)}$, that is, $\lim_{n \rightarrow +\infty}{I_{(1-p)}(n-x,1+x)}$?

EDIT [April 27th, 2015]:

I have plotted $I_{(1-p)}(n(1-c),1+nc)$ for high values of $n$ and it tends to what it was easy to guess: a CDF function in which 100% of the probability is concentrated in $c=p$, which means: when the sample size $n$ tends to be equal to the whole population sample size, the observed proportion $\hat{p}$ tends to equal $p$. Sorry for suggesting this.

Vicent
  • 639
  • 1
  • 5
  • 18
  • 1
    I suppose you measure the proportion in some number (say $n$) of small samples, giving variables $X_1, \dots, X_n$? Can you give us a histogram (or dot plots) of those values? – kjetil b halvorsen Apr 23 '15 at 17:30
  • @kjetilbhalvorsen, I am more interested in dealing with this problem from a theoretical point of view. The situation is that the process the data come from is quite unstable (using quality control terms), which means that the real proportion $p$ can change, so when plotting data I cannot actually assume that they come from the same population. I just want to make the assumption that "if $p$ was constant, then..." and try to deduce some theoretical properties of $\hat{p}$. – Vicent Apr 23 '15 at 17:42
  • 1
    OK, but how many small samples do you take? sampling in a way that helps ensure independence? the same volume of stuff in all small samples, or do they vary? – kjetil b halvorsen Apr 23 '15 at 17:45
  • @kjetilbhalvorsen, the volume of stuff $m$ is intended to be constant in every sample we take. – Vicent Apr 23 '15 at 17:52
  • 1
    Then, you can estimate the proprtion by the arithmetic mean of the sample proportions, and you could construct a confidence interval by bootstrapping, for example. I dont think this has anything to do with the beta distribution. – kjetil b halvorsen Apr 23 '15 at 17:54
  • @kjetilbhalvorsen, I am interested in a theoretical approach, but thank you for your suggestion. – Vicent Apr 23 '15 at 18:14
  • 1
    There's no general answer to how $\hat{p}$ has to be distributed - why shouldn't you be able to estimate the proportion as accurately in samples of 1 mg as of 1 kg? The analogy with the binomial distribution is perhaps misleading - the b.d. arises from particular assumptions about the data-generating process, viz a fixed number of independent Bernoulli trials. – Scortchi - Reinstate Monica Apr 23 '15 at 18:23
  • 1
    Your latest edit: No, this question belongs here, not on math SE. You want a theoretical discussion, nice, but I dont think there is much here to go on. You just have a real parameter, that it can be interpretyed as a proportion is not very relevant, and the analogy with the binomial distribution is misleading. – kjetil b halvorsen Apr 23 '15 at 18:26
  • Well, I have seen in some references that binomial and beta distributions are somehow related. Why can't the discrete version of my case be seen as a binomial case? – Vicent Apr 23 '15 at 18:35
  • 1
    There isnt a "discrete version of your case". Your case is not discrete. That analogy is false. – kjetil b halvorsen Apr 23 '15 at 22:24
  • @kjetilbhalvorsen, actually, in my humble opinion, I think there *is*. Imagine that I randomly pick $m$ units out from a population of a nearly-infinity number of units, the proportion of *correct* units in the population being $p$. Then, the number of correct units $X$ in the $m$-sized sample can be modelled as a binomial random variable (if some assumptions hold). This what I mean when I refer to the *discrete* version of my case. – Vicent Apr 23 '15 at 22:33
  • 1
    Well, but that case is where for each of the samples, either it is all water or it is no water. I don't think that is close to your case? I doubt that analogy is useful. And we have no base to choose some family of distributions for your case, as you didn't tell us anything about the measurement process. It would be more useful for you to do that. – kjetil b halvorsen Apr 23 '15 at 22:56
  • 1
    Just a quick interjection: What the OP is asking about falls under the general topic of compositional data analysis. See, for example: http://en.m.wikipedia.org/wiki/Compositional_data – Wolfgang Apr 24 '15 at 12:04
  • 2
    Your proposed distribution for $\hat{p}$ seems absurd *prima facie* because it depends on the arbitrary choice of units in which to measure mass. How did you deduce it, & what assumptions did you make about the data-generating process? IMO you need to either clarify the theoretical situation you want to consider or provide details of the "real case" you mention - the beta distribution certainly can be useful for empirically modelling continuous proportions, because of its bounds & its flexibility, though for a simple mean estimate the approach suggested by @kjetilbhalvorsen might be preferable – Scortchi - Reinstate Monica Apr 24 '15 at 15:15
  • @Scortchi, I added more information directly in my question as an answer to your comment. – Vicent Apr 26 '15 at 09:39
  • 1
    As comments to date all state or imply, there is no model for your case without formal assumptions about the data generation process that lets someone write down some algebra for the probability distribution; I don't think you have grounds for saying anything much except that you have a proportion bounded by 0 and 1. Saying that you have a mixture doesn't push you forward. – Nick Cox Apr 26 '15 at 09:44
  • @NickCox, thak you for your comment. What *'formal assumptions about the data generation process'* would be needed? I could provide more information or try to state more assumptions. My intuition was actually that there is some information missing. But, at the same time, the fact is that the underlying *real* proportion $p$ is all what I have, and as it is enough for the (binomial) *discrete* case, I thought it would be also enough here. – Vicent Apr 26 '15 at 09:49
  • 1
    You'd need to have some algebraic structure that matches the physical processes involved. I have no ideas on what that might be. Wanting a formal method here doesn't produce it. As already pointed out, the parallels with the binomial situation do not exist in a way that helps you. – Nick Cox Apr 26 '15 at 10:21
  • @NickCox, thank you again. So, should I provide more theoretical information/assumptions about how each sampled quantity of product is supposed to be collected, for instance? – Vicent Apr 26 '15 at 11:08
  • 2
    You can add more information, but more qualitative information seems unlikely to make your question more precise. – Nick Cox Apr 27 '15 at 08:35
  • 1
    @NickCox, thank you. I think I understand what you mean. The main point is: What (from a physical/technical point of view) makes the observed proportion $\hat{p}$ in the sample of $m$ kilograms not to be always equal to the *real* one $p$?, and *how* does it make it? I don't know if I have an answer to that. – Vicent Apr 27 '15 at 09:02
  • 1
    That's exactly the point. And if you don't make assumptions as to why the observed proportions differ from sample to sample, you can't say anything about their distribution. It might be that the samples have identical compositions but the measurement process - e.g. dissolving the sample, adding a precipitant, then drying & weighing the precipitant - introduces variability into the results, in which case it's entirely beside the point to be thinking about binomial sampling involving large numbers of particles counted with perfect accuracy. – Scortchi - Reinstate Monica Apr 27 '15 at 12:33
  • If you put all the material in a giant blender with good mixing of the tiny dust-sized particles that result, then your 15kg sample will be highly accurate. However, if the material hasn't been blended well, so some days it's all paper but other days there's a lot of metal, then your 15kg sample is probably a horrible estimate. Your problem description says "We can assume that the target component is sort of randomly distributed within the product" but that is pretty vague -- you need to know how good your blender is. If you have a lot of 15kg samples, then their variance can tell you this. – Matt Sep 06 '16 at 16:14

1 Answers1

1

If the water content is homogeneous in the 300 kg product, then there is no variance and the measured water content applies to the whole 300 kg product. If the water content is not homogeneous, a single 15 kg sample taken from one place tells you nothing about the variance over the entire product.

If the distribution of water is random across the product, you could take multiple samples, say 15 1 kg samples ($n=15$) from different parts of the product, which we now define as 300 1 kg portions. Measure the percent water in each sample, compute their mean and standard deviation $s$, and compute the standard deviation of the sampling distribution as $(s/\sqrt{n})$ FPF where FPF, the finite population factor, is $\sqrt{(N-n)/(N-1)}$ and $N$, the finite population size, is 300 chunks of 1 kg.

If the water content is not homogeneous but patterned, as in fat in a hog carcass, then the mean water content can be estimated from the water content of a single sample taken from a specific location and the known pattern.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156