1

I have written a function (in R-language) to perform a Theil-Sen regression with large sample approximation (please see below). It seems to work well for the one-sided case as illustrated in Hollander & Wolfe (2nd Edition Ch. 9)

NP.lm <-function(dat, X, Y, alpha) {

  dat <- dat[order(dat[,X]),]
  n <- nrow(dat)    
  combos <- combn(n, 2)
  i.s <- combos[1,]
  j.s <- combos[2,]
  Y.num <- dat[j.s,Y] - dat[i.s,Y]
  X.dom <- dat[j.s,X] - dat[i.s,X]      
  Z.p <-  qnorm(alpha/2, lower.tail=F)
  N <- (n*(n-1))/2
  s <- (Y.num/X.dom)[X.dom != 0]
  C.stat <- sum(sign(s))  

  slope <- median(s, na.rm=TRUE)                 
  intercept <- median((dat[,Y] - slope*dat[,X]), na.rm=TRUE)     

  C.star <- C.stat / sqrt(n*(n-1)*((2*n)+5)/18)
  p.val <- pnorm(-abs(C.star))

  out <- data.frame(colnames(dat[X]), colnames(dat[Y]), slope, intercept, p.val)

  out

} 

Example usage

set.seed(123) # to make the example fully reproducible
sigma <- as.matrix(data.frame(c(1, .7), c(.7, 1)))
N <- 50
dat <- data.frame( matrix(rnorm(N*nrow(sigma)), N, nrow(sigma)) %*% 
    chol(sigma) ) # cor structure via Choleski Decomp.
colnames(dat) <- c("X", "Y")
summary(lm(Y~X, data=dat))
NP.lm(dat=dat, X=1, Y=2, alpha=0.05)

  colnames.dat.X.. colnames.dat.Y..     slope  intercept        p.val
1                X                Y 0.6719583 0.08337517 1.227402e-07

# This p-value can't be right

Could someone please tell me what I am doing wrong, and how to make the test 2-sided.

Thank you so much for your help!

COOLSerdash
  • 25,317
  • 8
  • 73
  • 123
Patrick
  • 1,381
  • 1
  • 15
  • 21
  • 1
    (1) Your basic problem is that you work too hard: almost all this code is repeated at least once. That leads to errors and makes them more difficult to debug. For instance, did you really intend to have two different formulas for `C.stat`? (I don't think either one of them is correct: you seem to assume the data are sorted by increasing `X`.) (2) What precisely do you mean by "make this function two-sided"? – whuber Jun 20 '13 at 18:40
  • whuber, your point is well taken re: repeated code. I will absolutely change this. – Patrick Jun 20 '13 at 18:59
  • What I mean by "make this function two-sided"? I didn't articulate this well, sorry. I mean, how do I make the p-value output that from a two-sided test. – Patrick Jun 20 '13 at 18:59
  • I believe that aspect of your question is answered at http://stats.stackexchange.com/questions/19195/explaining-two-tailed-tests. – whuber Jun 20 '13 at 19:09
  • 1
    Do this instead: `s – whuber Jun 20 '13 at 19:25
  • whuber; I simplified the function by removing the alternate slope & added 1st step by sorting ascending according to X. Thank you for the link, I will look at this. – Patrick Jun 20 '13 at 19:25
  • That makes sense; I'd like to flag this as the correct answer but it appears I can only say its useful. Cheers! – Patrick Jun 20 '13 at 19:30
  • Possibly redundant now since your question seems to be answered, but this may help make it a better question for others - given that most of us probably don't have Hollander & Wolfe on hand and it's often hard to figure out what people are trying to do from code ... can you explain in more detail what H&W do there? – Glen_b Jun 20 '13 at 21:25
  • Glen_b: an annotated version? – Patrick Jun 21 '13 at 01:09
  • I'm not sure what you're asking; I'm asking you to explain in a sentence or two (perhaps with a little algebraic detail for clarity) what it is they talk about there. One might guess from the context, but one might also guess wrong. I want an explicit explanation of what your code is trying to achieve, in order that we understand what it is actually meant to do rather than what it appears to do. With the intent of your code properly understood, it's clearer how it might be modified. – Glen_b Jun 21 '13 at 01:30
  • Hollander & Wolfe (Nonparametric Stat. Methods) describe & provide step by step procedures for classical NP methods. I am trying to write working code for the Theil-Sen regression, which basically utilizes the median value of slopes between all possible data points. I initially wanted help as my p-value calc. seemed incorrect & whuber indicated additional issues. I now think it might be easier to let the functions cor.test(stats) or Kendall(Kendall) do this for me for several reasons. (see next comment) – Patrick Jun 21 '13 at 18:01
  • 1) the C-stat of Theil-Sen is the Kendall's k-stat computed between x and Y-b0*x (giving equivalent statistic values and p-values), 2) cor.test & Kendall functions (mentioned above) perform exact calculations when sample sizes are too small for large sample approximations. That being said, the p-values from the code given above (with whubers corrections) do not give the same as that from Kendall() or cor.test. Am I misunderstanding something else here. – Patrick Jun 21 '13 at 20:58

0 Answers0