1

My geographical zone $A$ is subdivided in $k$ different types of areas: $A_1 + A_2 + \dots{} + A_k = A$. These have been measured on a map with neglictible uncertainty: i.e. for any point on the map, it is unambiguous whether the point lies on type $1, 2, \dots{}$ or $k$.

In order to "sample from A", a tier has defined a clear-cut sub-area $B \subset A$. So I can also measure all intersections $(B_1=B\cap A_1) + (B_2 = B \cap A_2) + \dots{} + (B_k=B\cap A_3) = B$ with very few uncertainty.

In the end, my data looks like

A = 150m²
B/A = 60%

type  |    A |    B
1     | 1.2% | 1.0%
2     | 0.5% | 0.7%
...   
k     | 7.5% | 8.9%

So I have a vector $D_A = (\frac{A_1}{A}, \dots{}, \frac{A_k}{A})$ that represents the overall distribution of the various types of areas in my geographical zone, while another vector $D_B = (\frac{B_1}{B}, \dots{}, \frac{B_k}{B})$ represents the "sampled" distribution of these types in the sub-area $B$.

Now, I am in charge of deciding whether or not $B$ is a good sample, i.e. whether it is representative of the various area types in $A$.

Therefore, I suspect that my statistical mission is to compare $D_A$ and $D_B$ so as to answer the question: Is there a significant difference between $D_A$ and $D_B$?

My problem is that I am not sure what to compare $(D_A - D_B)$ against, because they do not differ due to some "experimental randomness". In fact, the reasons they differ lie in complicated constraints worked around "at best" by the tier during the process of defining $B$, and I am not even aware of those.

So, do I have enough data to answer the question?
If yes, what is the right comparison method in this case?
If not, what can I use as a significance criterion?

iago-lito
  • 143
  • 7
  • 1
    Could you explain what "types distributions" are? Indeed, there are no numerical quantities in evidence apart from the percentages (which presumably are fractions of the total area) and *by construction* $B$ differs from $A$ (its values will be consistently smaller). – whuber May 18 '20 at 13:40
  • @whuber Sure. I call "types distributions" the percentages that are indeed the only numerical quantities in evidence, *i.e.* $(A_1/A, A_2/A, \dots{}, A_k/A)$ and $(B_1/B, B_2/B, \dots{}, B_k/B)$. I wish to know whether those two vectors look statistically alike, but I am unsure what to compare their difference with. I'll edit the post to make it clearer. – iago-lito May 18 '20 at 13:51
  • 1
    Could you explain what you mean by "statistically alike"? This is an important consideration for developing any answer, especially because (so far) there is no indication of anything that could be modeled as random or uncertain. – whuber May 18 '20 at 13:56
  • @whuber I understand. I have reformulated the question to focus on this elusive source of uncertainty, which lies at the heart of my questionning. – iago-lito May 18 '20 at 14:15
  • 1
    Thank you for the ongoing improvements--the nature of the question is becoming apparent. It looks, though, like information about how the areas are sampled and what measurement(s) are made in each area will be important for developing an answer, because we still don't have information about how and why the distribution of $B$ could differ from the distribution of $A.$ – whuber May 18 '20 at 14:19
  • @whuber Yes. The thing is that I don't know why they differ either, why may be the very source of my problem. I have reformulated again to make it explicit. – iago-lito May 18 '20 at 14:42

0 Answers0