0

I have a population in which some have an event A and some other don't. Event A is actually my target class. I also have a set of variables/features for my population which I can use in a modeling (supervised learning) setting. Let's say one of the features/variables is age. What I'd like to find is the impact of age on event A in a very intuitive way. Assume my population size is 2000 and 100 of them have event A and the rest don't. I somehow came up with a cutting point for the age, e.g. less that 40 years old and greater than 40 years old. Here is the distribution of the population:

                  Have event A       don't have event A
less that 40              20                   100
greater than 40           80                   1800

To show the impact of age on event, I do the following : p(have event A| age less than 40) / p(have event A/ age greater than 40) = (20/120) / (80/1880)

However, I'd like to find something like a p-value for this calculation. Howe can I do that?

HHH
  • 253
  • 5
  • 15

1 Answers1

2

If you are okay with asymptotic variance using log transformation and delta method, then $$\text{SE}(\log(RR)) = \sqrt{\frac1{20}+\frac1{80}-\frac1{120}-\frac1{1880}}\approx 0.2316$$

With the observed $\log(RR)\approx 0.5929$ and assuming you want to test if $RR=1$ which is equivalent to $\log(RR)=0$, the $z$ statistic is about $2.5600$. So two-tailed p-value is about 0.0105.

In case you are interested, here is a link that shows derivation for the variance: Why doesn't standard error for ratios have log in it?

Exact confidence interval is also available, but not as easily obtained: How to calculate the "exact confidence interval" for relative risk?

  • is the SE(log(RR)) a confidence factor like p-value? – HHH Aug 17 '17 at 00:15
  • In a sense, yes. Standard error basically measures the spread/certainty of your estimate. Higher SE means you're less confidence in your estimate (wider confidence interval), meaning it is more likely to contain the value of your null hypothesis, leading to higher p-value. – Jirapat Samranvedhya Aug 17 '17 at 15:55
  • thanks, and may I know what values of SE is typically considered as high confidence (like p-value <0.05)? – HHH Aug 17 '17 at 16:02
  • Suppose the observed statistic is $z$, null hypothesis value is $\mu$, and standard error is $se$. Assuming normal distribution, the p-value for two-tailed test is $$P(|Z|>\frac{z-\mu}{se})$$ where $Z\sim N(0,1)$. That is, you also need to take into account of observed statistic and the null hypothesis. – Jirapat Samranvedhya Aug 17 '17 at 16:05
  • for the calculation you have above with SE=0.23, is it considered a high confidence or low? – HHH Aug 17 '17 at 16:13
  • As I mentioned, SE has to be considered in relation to the estimate (observed statistic). For example, if SE=0.23 and my estimate is 100, then I'm fairly confident that the interval (99.54, 100.46) would contain the true value. On the other hand, if my estimate is something like 0.01, then to my confidence interval would be (-0.45, 0.47). – Jirapat Samranvedhya Aug 17 '17 at 16:38
  • so in the calculation I have in my original post, the value is 3.91, so 0.23 as the interval is good enough. Thanks. – HHH Aug 17 '17 at 17:49
  • If I have (5/5) / (222/358960) then the SE = SQRT(1/5+1/5-1/222-1/358960) = 0.62 which is considerably small compared to my observed statistic is 1616, right? – HHH Aug 17 '17 at 19:51
  • Everything has to be on the same log scale. Your observed statistic is actually $\log \frac{5/5}{222/358960} = 3.2087$, so SE of 0.62 is indeed quite small. – Jirapat Samranvedhya Aug 17 '17 at 19:58
  • and what would be the p-value in this calculation? – HHH Aug 17 '17 at 20:04
  • $3.2087/0.62=5.1753$, so it is $P(|Z|>5.1753)$, which is very close to zero. – Jirapat Samranvedhya Aug 17 '17 at 20:17
  • sorry for too many questions, where did you get the final p-value from? – HHH Aug 17 '17 at 20:40
  • It's the probability of a standard normal being more extreme than 5.1753 in absolute value. – Jirapat Samranvedhya Aug 17 '17 at 20:42
  • and considering my original question, is it a fair to consider the z-score? – HHH Aug 17 '17 at 20:44
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/64050/discussion-between-h-z-and-jirapat-samranvedhya). – HHH Aug 18 '17 at 13:35
  • Going back to my original post, I was about to find the impact of age on event A. Using this transformations, is the impact 1616 or 3.2087? – HHH Aug 18 '17 at 13:39