2

I plot Score (y variable) against Year (x variable) and I want to be able to show that there is an increase in the score as the year increases. The scatter plot indeed indicates this but I would like to get a p-value for this by way of an appropriate hypothesis test on this slope. I believe that a linear regression t Test is not valid as my data Year is not normally distributed (it being discrete and uniform).

I thought to then create a simulation and I want to know if what I have done is a valid technique.

  1. I find the linear regression slope of the actual data, which I called slopeActual.
  2. I did a simulation (say 1000 times) whereby at each loop I permuted the y values (Score) and calculated a regression slope. These values I stored in a list I called slopeList.
  3. I calculate: p-value = P(slopeActual>0 | there is no association) = proportion of values in slopeList greater than slopeActual.

When I did this for the data below I got a p-value of 0.0087.

So, the question again: Is this method valid?

data1 <-
structure(list(Year = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
5L, 5L, 6L, 6L, 6L), Score = c(-5.2, -2, -1, -3, -5, 3.8, 0, 
-3.2, 1.2, 2.2, 11.5, -4, 10.2, 2, 12, 6.5, 6, 6, 9.2, 4.2, 13, 
0.8, 8.5, 4.5, 6, 6, 2.7, 8, -3.8, 6.7, 4.5)), .Names = c("Year", 
"Score"), class = "data.frame", row.names = c(NA, -31L))
Tim
  • 108,699
  • 20
  • 212
  • 390
Geoff
  • 451
  • 2
  • 14

1 Answers1

1
  1. This premise of the question is misplaced: "I believe that a linear regression t Test is not valid as my data Year is not normally distributed (it being discrete and uniform)."

    ... there's no such assumption in regression. Your x-variables are not assumed to have any particular distribution (indeed, neither is there an assumption about the marginal distribution of the y-variable)

  2. "p-value [...] = proportion of values in slopeList greater than slopeActual"

    Very nearly correct.

    You should include the original sample in your list, and then count the cases greater than or equal to that one.

    I'd also suggest doing it more than 1000 times; your estimate of the p-value will be pretty variable and if it happens to come out close to a boundary you may want it to be reasonably precise (it won't make much difference in this case unless you're doing a 1% test).

On your data with the above modifications and 10000 simulations (each such took about 17 or 18 seconds on my laptop) I got p-values just below 0.0084, 0.0083, and 0.0092 across three trials.

(The one-tailed p-value under the ordinary regression assumptions is a bit below 0.0078; doing a lot more permutations seems to be getting us quite close to that -- after 60000 permutations I have a p-value of about 0.00797.)

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Thanks for the response. According to http://www.statisticssolutions.com/assumptions-of-linear-regression/ one of the assumptions for linear regression t test is multivariate normality; that's why my belief was as it was. – Geoff Aug 26 '17 at 09:13
  • That site does indeed say that: "*the linear regression analysis requires all variables to be multivariate normal*" -- unfortunately, whoever wrote that has *no damn clue what they're talking about*. With posts here, there's usually a degree of review -- you can check the votes and comments and you can check the reputation and other answers of people that respond to questions and get some idea of how much weight you want to put on what they say. By contrast, if you're reading a random website on the internet you could have *anything* -- it might be fine or it might be complete nonsense... ctd – Glen_b Aug 27 '17 at 01:10
  • ctd... so I'd start with the various posts that discuss regression assumptions here (there's some disagreement over what strictly counts as an *assumption* but not much actual disagreement over whether they apply or not). I will say that if you derive the regression results, it's obvious what you assumed when, and anyone who can't derive them - or at least show you where you can find the derivations - should be listened to with great caution. You don't know where they got it from. – Glen_b Aug 27 '17 at 01:14
  • Resources on regression assumptions: Normality in regression: **1.** https://stats.stackexchange.com/questions/12262/what-if-residuals-are-normally-distributed-but-y-is-not **2.** https://stats.stackexchange.com/questions/280189/linear-regression-and-assumptions-about-response-variable **3.** https://stats.stackexchange.com/questions/148803/how-does-linear-regression-use-the-normal-distribution **4.** https://stats.stackexchange.com/questions/130775/why-do-we-care-so-much-about-normally-distributed-error-terms-and-homoskedastic ...ctd – Glen_b Aug 27 '17 at 02:03
  • ctd... Some of the main threads on what the assumptions are. Read this with some care - if you take into account the various comments, you'll probably have a reasonable picture **1.**.https://stats.stackexchange.com/questions/16381/what-is-a-complete-list-of-the-usual-assumptions-for-linear-regression **2.** https://stats.stackexchange.com/questions/32285/assumptions-of-generalised-linear-model **3.** https://stats.stackexchange.com/questions/86830/transformation-to-normality-of-the-dependent-variable-in-multiple-regression ...ctd – Glen_b Aug 27 '17 at 02:03
  • ctd... Where do the assumptions come from: https://stats.stackexchange.com/questions/55113/where-do-the-assumptions-for-linear-regression-come-from • Checking assumptions: https://stats.stackexchange.com/questions/45685/testing-assumptions-of-multiple-regression • Assumptions with categorical independent variables: https://stats.stackexchange.com/questions/226584/regression-assumptions-not-required-for-categorical-dummy-variables • illustration discussing two assumptions: https://stats.stackexchange.com/questions/96619/validity-of-regression-assumptions-on-residual-plot ...ctd – Glen_b Aug 27 '17 at 02:04
  • ctd... when is a problem suited to linear regression? https://stats.stackexchange.com/questions/177015/clues-that-a-problem-is-well-suited-for-linear-regression Why are diagnostics based on residuals: https://stats.stackexchange.com/questions/76163/why-are-diagnostics-based-on-residuals ... ctd – Glen_b Aug 27 '17 at 02:05
  • Finally, some web things: A more reliable source on assumptions: http://andrewgelman.com/2013/08/04/19470/ (my list would differ somewhat as I discuss [here](https://stats.stackexchange.com/questions/152567/independence-of-error-in-linear-regression/152579#152579) but it's a good list) There's also [Wikipedia](https://en.wikipedia.org/wiki/Linear_regression#Assumptions) which is more-or-less okay on this, but since articles may change at any time, a degree of caution is sometimes needed. – Glen_b Aug 27 '17 at 02:05