3

I've been trying to figure out the correct way to calculate the p-value for my data. I originally created a simulation that randomly selected numbers that were greater than or less than a certain number in a specific range given by a dataset. Let me give you an example of what my datasets looked like for clarity:

Expected dataset:

exon_number    number_of_exons
4              20
5              16
2              31
4              15
15             20

Observed dataset:

exon_number    number_of_exons
21             30
15             18
16             20

For each line in my datasets, I randomly selected say 100 numbers between 1 and 20 (for an example from the expected dataset) and determined if the randomly selected number was greater than or less than the exon_number. If it were greater than, I would bin it to the greater than bin. I would do this for all the lines in my datasets and created a total greater than or less than bin for my entire dataset. However, since my datasets were of different sizes, there were a greater amount of greater-thans or less-thans complied for the "expected dataset". Is this problematic? Here are my real results:

               Expected    Observed
Less than      698402      11105
Greater than   918898      13573

I understand that the Fisher's exact test is only for small numbers and should not be used, am I correct? I'm trying to test if the observed data seems to cluster more in the beginning or end of a transcript compared to the expected results.

In that case, I've been using the chi-squared method as below:

import numpy
import scipy.stats
scipy.stats.chisquare([11105, 13573], f_exp=[698402, 918898])

However my output gives me a p-value of 0. Am I doing something wrong? Am I running the test incorrectly? Is my data problematic? I'm new to programming and statistical testing. Any help would be greatly appreciated (and explanations)

cosmictypist
  • 143
  • 1
  • 6
  • 1
    Why are you doing this in the 1st place? What are you actually trying to figure out? – gung - Reinstate Monica May 19 '15 at 15:04
  • I'm trying to determine if there is a consensus for the exon's in the observed data to be in the beginning or end of a transcript – cosmictypist May 19 '15 at 15:07
  • @gung I also edited my post to include what I'm trying to test – cosmictypist May 19 '15 at 15:10
  • If we tried to translate that into statistical terms, are you wondering if the exons are uniformly distributed? Or are you wondering if the mean exon number is higher or lower than a fixed value? – gung - Reinstate Monica May 19 '15 at 15:11
  • @gung I believe I am trying to determine if the mean exon number is higher or lower than a fixed value – cosmictypist May 19 '15 at 15:16
  • And you know what that value is in advance, is that right? – gung - Reinstate Monica May 19 '15 at 15:19
  • Well, yes. However, I think I'm interpreting your question incorrectly. Since every line in my dataset is different, the fixed value will always be different (in this case, the fixed value would be the first columns in my datasets). What I've done is find if my randomly selected numbers were before or after the fixed value. If the random value was greater than the fixed value, I would add 1 to my greater than bin. I don't think I understand your question correctly – cosmictypist May 19 '15 at 15:30
  • @gung In these terms, I would think that I'd be able to test to see if the frequency of my expected data is different than the frequency of my observed data. That's why I thought I could use the chi squared test – cosmictypist May 19 '15 at 15:34
  • TBH, I'm having trouble following your study, your hypothesis, & your analysis. Your procedure based on randomly sampling something is not going to be appropriate. You say that your "datasets were of different sizes", are your `expected` counts coming from observed data (perhaps from an earlier study)? – gung - Reinstate Monica May 19 '15 at 15:46
  • When you say they come from "the first columns in my datasets" ... is the first column somehow special? What does it represent? Is it a sample of observed data, too? – Glen_b May 20 '15 at 03:45

1 Answers1

3

You are right that you don't want to use Fisher's exact test here. There isn't anything wrong with using it with large numbers, but it ends up being approximated then, so you lose the 'exact test' advantage that people sometimes want. In addition, Fisher's exact test assumes the marginals are fixed in advance, which isn't true here (and is in fact rarely true).

The reason your chi-squared test is not working properly is that the numbers for the expected and observed are not similar. The expected counts sum to 1617300, whereas the observed sum to 24678. What you really want to compare your observed counts to is the expected proportions. Using your data (and R), here is an example:

chisq.test(x=c(11105, 13573), p=c(698402, 918898)/1617300)
#         Chi-squared test for given probabilities
# 
# data:  c(11105, 13573)
# X-squared = 33.185, df = 1, p-value = 8.381e-09

I do not believe this is the right analysis for your question, though. I suspect you need to something like a Mann-Whitney U-test or Wilcoxon signed rank test.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Are you allowed to use proportions to compare to your observed results with the chi-squared? – cosmictypist May 19 '15 at 16:24
  • @christylynn002, a one-sample goodness of fit chi-squared test compares observed counts with counts based on the expected proportions, so yes. I still suspect the chi-squared test isn't actually the right test here, though. Your implementation of the chi-squared test is based on this strange random sampling thing you are doing. So I don't think any resulting chi-squared test will be valid. – gung - Reinstate Monica May 19 '15 at 16:28
  • I think I figured out what I need to do. Thank you – cosmictypist May 19 '15 at 16:56
  • 1
    christylynn002 -- I'm glad you figured out what you need, but perhaps you could explain what you did to resolve your issue -- there are several unresolved problems or potential problems with your explanation of the question that a proper solution might clarify. – Glen_b May 20 '15 at 03:47