How to compare a new measurement against an existing null distribution?

Question

I have a dataset that describes the distances between two identical genes in two replicate experiments (gene a in rep1 vs gene a in rep 2). Let us assume that due to biological variation, the distance between the genes of two replicates should be small, but not necessarily zero. Thus from what I can understand, this dataset would represent a null distribution.

I want to use these samples of observations to get an idea of what constitutes a 'non-significant' distance between two genes.

Ultimately, I want to be able to take two different genes, calculate their distance, and estimate how 'significant' the resulting distance metric is (e.g. is the distance similar to the to those in my replicate dataset or is it very different).

What is a reasonable approach to do this type of analysis / hypothesis testing?

EDIT:

I have carried out a series of biological experiments where the output of the experiment is a N x N matrix of counts. I then created a custom distance metric that takes in two rows of counts and calculates the 'difference' between them (I will call this difference metric D). I calculated D for all pairwise comparisons and now have an array of difference metrics D called D_array.

My assumption based on biology is that the majority of D in D_array represent that there is no significant difference between the two rows of counts and only the >= 95% interval of D metrics actually represent real differences between two rows of counts. Let us assume that this is true, even if it doesn't make sense.

So this means if D_array = [0, 1, 2, 3, 4 ... 99] (100 metrics) then only a D score of 95-99 are actually representative of a real difference between two rows of counts.

Note: D_array is not representative of my data. My actual data actually has a distribution of values like this (black line represents the mean): https://imgur.com/usvvIgB

Given D_array I want to be able to determine whether a newly calculated distance value D' is "significant" based on my previous data: the distribution of my D_array. Ideally, I would like to provide some sort of metric of 'significance' such as a p-value. By significance I mean the probability / significance of having gotten a result as extreme as D'.

score 0 · Answer 1 · answered Jul 30 '21 at 07:51

0

I feel it is difficult to come with a relevant approach here since we need to understand better your data. But from a high level approach it seems that you are trying to compare different samples with a control group.

My understanding is even that you would like compare each sample with the control group (meaning not al of them at the same time). In that case, depending on the type of your distribution (and if it meets the requirements of the analysis), something like a t-test in difference in means could be relevant.

Note: we are talking about the null hypothesis and not null distribution.

answered Jul 30 '21 at 07:51

Pitouille

1,506
3
5
16

I have added more information regarding my data. I think that there are a number of assumptions for the t-test that I cannot be sure my data satisfies. My original thought was to use bootstrapping to estimate confidence intervals or a GMM to calculate the probability that a new metric fits my prior data. – cag104 Jul 30 '21 at 07:56
Bootstrapping is often a good tool to get a better understanding of your data and most likely it will lead you to the next steps in your analysis – Pitouille Jul 30 '21 at 08:09
2

So, if I got it right, it seems that you would like to demonstrate that your samples are "significantly similar"... which is a different demonstration. There is an interesting approach explained by Spätzle here (equivalence testing) https://stats.stackexchange.com/questions/535380/how-to-test-if-two-means-are-significantly-similar/535383#535383 that you might find interesting. – Pitouille Jul 30 '21 at 09:39
the test of equivalence suggested in that post does seem like exactly what I want to do. Very helpful! Thank you! – cag104 Jul 30 '21 at 16:53
I saw you created a new post… so I reply here for the time being because I do not know whether I am slow to understand or if more explanations are required… and therefore I do not want to pollute your main post. When you calculate your distance metric (difference between 2 rows), do I get it right if you mean here that the result is a vector 1xn ? What is the dimension of your array ? N/2 x N ? Actually I am mot sure to get the process right. – Pitouille Jul 31 '21 at 14:58

How to compare a new measurement against an existing null distribution?

1 Answers1