Twenty samples should be the same. But How should I test that

Question

I have a table derived from a group of polypeptides or proteins. Because they are proteins they are made up of 20 amino acids. Iff a group of proteins contains a random sample of amino acids then the probability of each amino acid should be, $r(X = x) = 0.05$.

For example, I find: {0.45, 0.48, 0.5, 0.55, 0.52, ..., 0.5} ($n=20$ values)

Q #1: How can I determine if these 20 values are all within a 95% range of a random sample or not, where $n=100$ proteins.

Q #2: I have many groups (files) of proteins which can contain anywhere from 100 to 10,000 polypeptides. I assume that this number of polypeptides will effect the standard error and therefore the 95% range of expected amino acids = 5%.

Is this simply a calculation using a normal distribution and using Bonferroni adjustment of (5%/20 samples) for my hypothesis testing?

Each of the 20 amino acids does not have the same representation within a protein. Some amino acids are much higher than 5% of the total and others are much lower. See [this paper](https://doi.org/10.1371/journal.pone.0077319) for an interesting comparison of amino-acid compositions among organisms in different environments. Please say more about a specific hypothesis you wish to test, as the equal-representation hypothesis was ruled out decades ago. — EdM, Mar 01 '19 at 16:24
Yes, I realize that the actual probability of each amino acid is not the same but I am considering using this line of thinking for feature selection or reduction. For example, if an amino acid is >95% Confidence Interval than it might be a candidate for feature selection. In fact I have found via PCA that some amino acids have much larger variances than others. But my question remains, what levels are significant or not from a statistical viewpoint? — mccurcio, Mar 01 '19 at 17:22
Could you please explain the meanings of these numbers that you "find"? What do they represent? — whuber, Mar 01 '19 at 17:33
You will need to specify your hypothesis in more detail. There are ways to analyze composition data, for example based on the [multinomial distribution](https://en.wikipedia.org/wiki/Multinomial_distribution), but you can't say what is "significant or not from a statistical viewpoint" unless you are more precise in stating the question you are trying to answer. For example, are you concerned with the mean composition values (as suggested by your question) or the variance of compositions among proteins (as suggested by your comment)? — EdM, Mar 01 '19 at 17:50
@whuber I have a text file containing 100 lines(proteins) with each line length=101. **For simplicity sake**, Each line is made up of 20 amino acids(AA) e.g. letters(A to T); line1: "ABCD..." . I **FIND** / calculate the % AA composition of the file. % AA composition of (i) = number of AA(i) / Total counts of AA, where (i)= 1 to 20 amino acids; letters(A to T). 1 of 2 — mccurcio, Mar 02 '19 at 02:16
@whuber 1) Assume a normal distribution to describe the random placement of 20 letters(A to T) in a line, 2) Assume that the 20 amino acids are random scattered through out all proteins the E[% AA composition of i] = 1/20. 3) I can determine statistically IF one amino acid is in/outside a two tailed test, using 95% as my critical point, mean +/- 1.96*stderror. 4) Shouldn't I use Bonferroni correction (5%/20 as my critical value) to test if ALL 20 amino acids are within, mean +/- 1.96*stderror. 5) [This is exactly my issue in a comical way](https://www.xkcd.com/882/) 2 of 2 — mccurcio, Mar 02 '19 at 02:18

score 1 · Answer 1 · answered Mar 03 '19 at 22:36

A normal distribution strictly applies only to continuous data, while you have categorical data: counts of each of 20 amino acids* in sets of individual proteins, with each protein being a sequence of dozens to hundreds or more of those amino acids. With a fixed number of categories (20 amino acids), you are dealing with multinomial distributions to describe the number of counts of each amino acid within proteins.

A classic way to compare count data (in your case, the numbers of each amino acid in a complete protein sequence, evidently predicted from the codons in the expressed mRNA sequence**) against a theoretically expected frequency (in your case, that the frequency of each amino acid is 0.05, an hypothesis well known not to be true) is to use a chi-square test to get an overall test of the null hypothesis that all expected and observed frequencies are equal. This test takes into account the total number of amino-acid counts, important here because different proteins have different amino-acid sequence lengths and random sampling would make observed frequencies more variable in smaller proteins.

You could use a chi-square test to test the observed amino-acid counts against a more realistic theoretical distribution. See this paper for an interesting recent examination of differences of amino-acid compositions of proteins among microorganisms in different habitats, and for links to the literature. You could also use a chi-square test to examine whether your proteins differ among each other in composition without regard to pre-specified hypotheses about amino-acid frequencies.

If an overall chi-square test does not rule out the null hypothesis then you shouldn't proceed to further individual comparisons. If the null hypothesis is rejected and you want to proceed to more detailed tests among amino acids or among proteins then you are correct that some form of multiple-testing correction is required. The Bonferonni correction you propose can be too stringent; see this Wikipedia page for links to other possibilities. Those detailed tests should be appropriate for categorical data; depending on the amino-acid counts you might consider multinomial tests, Fisher's exact test, or G-tests

You might also consider adapting the suggestions in this answer to model your data with a generalized linear model that can provide "comparisons of meaningful quantities, rather than just chi-squares and values." That might be a better approach if you are proceeding to examine hundreds of proteins. This page provides some hints on generalized linear modeling of multinomial data.

*You presumably are ignoring selenocysteine in your count of proteinogenic amino acids.

** With 20 amino acids you are counting asparagine and glutamine separately from aspartate and glutamate, which is most easily done based on coding sequences. Classic chemical analyses of amino-acid composition typically converted asparagine and glutamine to their acid forms, so be careful if you compare against that type of literature.

Thank you for a more comprehensive explanation, I do appreciate it. — mccurcio, Mar 04 '19 at 20:17

Twenty samples should be the same. But How should I test that

1 Answers1