I plot Score (y variable) against Year (x variable) and I want to be able to show that there is an increase in the score as the year increases. The scatter plot indeed indicates this but I would like to get a p-value for this by way of an appropriate hypothesis test on this slope. I believe that a linear regression t Test is not valid as my data Year is not normally distributed (it being discrete and uniform).
I thought to then create a simulation and I want to know if what I have done is a valid technique.
- I find the linear regression slope of the actual data, which I called slopeActual.
- I did a simulation (say 1000 times) whereby at each loop I permuted the y values (Score) and calculated a regression slope. These values I stored in a list I called slopeList.
- I calculate: p-value = P(slopeActual>0 | there is no association) = proportion of values in slopeList greater than slopeActual.
When I did this for the data below I got a p-value of 0.0087.
So, the question again: Is this method valid?
data1 <-
structure(list(Year = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L), Score = c(-5.2, -2, -1, -3, -5, 3.8, 0,
-3.2, 1.2, 2.2, 11.5, -4, 10.2, 2, 12, 6.5, 6, 6, 9.2, 4.2, 13,
0.8, 8.5, 4.5, 6, 6, 2.7, 8, -3.8, 6.7, 4.5)), .Names = c("Year",
"Score"), class = "data.frame", row.names = c(NA, -31L))