3

I have used the SIN link to estimate probabilities, mostly with Program MARK. However, I am not sure how the SIN link works. I know the SIN link enables parameter estimation on boundaries (when the probability is 0 or 1), while the logit link seems to have problems in those situations.

Given that the SIN link works in Program MARK, it seems to me that it should also work with logistic regression in glm in Program R since both are estimating proportions of successes out of n trials. However, there does not appear to be a built-in SIN link in R for glm. I asked a question about this on Stack Overflow when I thought it would be a simple programming issue and learned R does allow user-specified links in glm.

However, nobody posted the R code for the SIN link and I have encountered problems with the inverse of the link when trying to do it myself. Namely, there may be multiple solutions given the periodic nature of the SIN function. I imagine the search space must be constrained somehow to avoid multiple solutions.

The issue now seems more like a statistical problem than a programming problem and I am posting here hoping someone will either post the R code for the SIN link in glm or provide a description of how the SIN link works sufficient for me to codify it myself. The other option I considered was to post a bounty on my Stack Overflow question and I might still do that later.

Specifically, I am stuck on how to create the inverse of the link and whether the search space must be constrained to avoid multiple solutions resulting from the periodic nature of the SIN function. I am aware of the family statement in glm and the make.link statement and that user-specified links are possible in R, for example with the logexp link (?family in R).

Here is a link to my earlier Stack Overflow question:

https://stackoverflow.com/questions/25431923/using-the-sin-link-with-logistic-regression

The SIN link is:

(sin(X * Beta) + 1) / 2

where:

X = design matrix
Beta = parameter estimates

From my earlier post, note that sin(z) is bounded between -1 and +1 inclusive. Therefore, (sin(z) + 1) is bounded between 0 and 2 inclusive. Therefore, (sin(z) + 1)/2 is bounded between 0 and 1 inclusive. This inclusivity is probably what enables the link to return estimates of proportions even when probability is 0 or 1.

My Stack Overflow post also contains R code for an 'exhaustive search' further suggesting the link probably does work for estimation of proportions if the search space is somehow constrained.

Thank you for any help. Hopefully this is an acceptable use of cross-posting given the statistical focus of the present question and the programming focus of the earlier question.

I should point out that in Program MARK the SIN link seems to be reserved for models that contain only one entry per row in the design matrix, i.e., without covariates.

EDIT

The inverse of the SIN link, as near as I can tell, is: lin.pred <- asin((2*y) - 1) This returns the correct linear predictor as long as the linear predictor, z, falls in the range -1 <= z <= 1. If covariates are not allowed, as alluded to above, then perhaps this inverse function will always be acceptable. Of course, I am still going to assume for now that dummy variables are allowed in the linear predictor (at least if there is no intercept).

The above proposed inverse link might solve the immediate problem of the inverse and constraints on the search space. The next step is to incorporate this inverse link into a user-specified link function in glm in R along with any additional required information to estimate proportions.

Mark Miller
  • 639
  • 7
  • 12
  • You'll need to write a family and link-glm object. Check out ?family and ?make.link and the corresponding code. – Momo Aug 24 '14 at 13:30
  • @Momo Thank you. I got that far before posting today. I am stuck on how to write the inverse link and whether the search space needs to be constrained. – Mark Miller Aug 24 '14 at 13:33
  • This is good to know for a good answer. Perhaps edit the question to reflect that you need those elements. The constrained search space bit confuses me too; shouldnt the model allow all real values? – Momo Aug 24 '14 at 13:50
  • @Momo Thank you. I have added the fifth paragraph to emphasize those points. – Mark Miller Aug 24 '14 at 14:00

2 Answers2

1

I'm not clear on why you are starting a new topic on the subject, nor am I clear on the motivation for pursuing a $\sin$ link. A binary model with a $\sin$ link is no longer a logistic model. And there are no boundary problems with ordinary logistic regression. You seem to find something wrong with infinite parameter estimates in order to achieve probabilities of zero or one but in fact this does not present a problem for the logistic model.

If you wanted to fit a variance-stabilizing model you would use the $X\beta = \arcsin(\sqrt p)$ model, yielding $p = \sin^{2}(X\beta)$. This model looks very appealing (information matrix = $X'X$ times a constant) but when you plot the log-likelihood surface you find it is too flat to lead to efficient estimates of $\beta$.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • Thank you. Primarily now I am seeking understanding of how the SIN link works, rather than how to program it (albeit so I can program it myself if need be). – Mark Miller Aug 24 '14 at 13:30
  • 1
    Do you not have any interest in the questions of why you would want to use it and why the ordinary logistic model already works? – Frank Harrell Aug 24 '14 at 13:32
  • Yes. Those topics do interest me. From the example code in my earlier post I was under the impression the logit link had trouble on boundaries. So, I need to study your response and return to that earlier code to understand better what you wrote and how the model works at boundaries. Thank you again for your time. – Mark Miller Aug 24 '14 at 13:38
  • 1
    Sounds good. Be sure to precisely describe the challenge you face, because the apparent problem may not really be a problem. Note that it is often the case that one chooses a simple interpretable model such as logistic and relaxes assumptions about functional form by expanding continuous predictors using regression splines. – Frank Harrell Aug 24 '14 at 15:34
  • Thank you, Dr. Harrell. I am honored that someone with your background would take the time to respond to my questions. And I will study your comments. Although, I also wish to continue to try to understand the SIN link. I have been using it for years in my research at the recommendation of others who know far more than me, and I suppose it is high time I figured it out. I cannot even seem to find it mentioned on the internet, let alone statistics texts. – Mark Miller Aug 24 '14 at 16:42
  • 1
    I have never before seen it used for this purpose. – Frank Harrell Aug 24 '14 at 18:58
  • http://warnercnr.colostate.edu/~gwhite/mark/markhelp/link_functions.htm This appears to be one of the few (~3000) mentions of the sine link function on all of Google. I still don't understand the motivation for using it, though. The whole document uses non-standard terminology that I'm unfamiliar with. – Sycorax Aug 26 '14 at 13:45
  • 1
    Mark, have you seen Section 4 of Chapter 6 in the MARK user's manual (available at http://www.phidot.org/software/mark/docs/book/)? It says, again and again and again, that usually you really don't want to use a sine link. Apparently its sole purpose is to deal with some numerical problems with an "identity" design matrix (which the example in your answer is not) to obtain "initial estimates" (see p. 6-23). In fact, I cannot find any clear explanation in the manual for why one would want to use a sine link in an actual analysis--I find only extensive explanations of the problems it can cause. – whuber Aug 26 '14 at 15:11
  • 1
    @whuber Thank you. I will look at that section. My understanding was that Program MARK more-or-less using the sine link as the default if there are no covariates, i.e., if there is only one entry per row in the design matrix (regardless of whether that matrix is an identity matrix). I know lots of problems will arise if you use the sine link along with covariates. Maybe I am wrong. – Mark Miller Aug 26 '14 at 15:21
  • @whuber I also posted my question this morning at the Program MARK help forum with a link back to this Cross Validated post. So, maybe Dr. White (who wrote Program MARK) or someone else who knows more about the sine link than me will see this post and provide feedback. – Mark Miller Aug 26 '14 at 15:24
  • @FrankHarrell I asked on the Program MARK forum why anyone would every want to use the sine link. The author of the program responded, 'The sine link is most useful when parameter estimates are at the boundary, because the 2nd derivative of the likelihood is non-zero for the sine link, whereas you get a zero (within numerical precision) for the links that are asymptotic at the boundary.'. – Mark Miller Aug 26 '14 at 17:45
  • @FrankHarrell 'This behavior shows up when computing the rank of the information matrix to determine the number of parameters that were actually estimated, with the sine link providing a correct answer much more often than the other link functions that constrain parameters between 0 and 1.' – Mark Miller Aug 26 '14 at 17:46
  • @FrankHarrell He also wrote 'MARK uses (sin(beta)+1)/2, but the original reference used sin(beta)^2. See Box, M. J. 1966. A Comparison of Several Current Optimization Methods, and the use of Transformations in Constrained Problems The Computer Journal 9:67–77.' – Mark Miller Aug 26 '14 at 17:48
1

Instead of trying to implement the SIN link into glm in R I decided to try to write the likelihood function for the logit link and solve with optim. Then I repeated that approach simply substituting in the SIN link. Below is the R code for both. Both give the correct point estimate to three decimal places. The logit link did a better job. I was expecting better performance from the SIN link than what I observed. Perhaps I made an error somewhere.

The code for the logit link is modified from that found here:

Calculate coefficients in a logistic regression with R

I will update my answer if I learn more.

my.data <- read.table(text='
   y  x
   0  1
   0  1
   1  1
   1  1
   1  1
   1  1
   1  1
   1  1
   1  1
   1  1
', header = TRUE)

# create the design matrices
vY = as.matrix(my.data$y)
mX = as.matrix(my.data$x)

logLikelihoodLogit = function(vBeta, mX, vY) {
  return( -sum( vY * log( (1/(1+exp(-(mX %*% vBeta)))) ) + (1-vY) * log(1 - (1/(1+exp(-(mX %*% vBeta)))) )))
}

logLikelihoodSin = function(vBeta, mX, vY) {
  return( -sum( vY * log( ((sin(mX * vBeta) + 1) / 2)  ) + (1-vY) * log(1 - ((sin(mX * vBeta) + 1) / 2 ) ))) 
}

# set initial parameter values
vBeta0 = c(0.5) # arbitrary starting parameters

# minimize the negative log-likelihood for logit link
optimLogit = optim(vBeta0, logLikelihoodLogit, mX = mX, vY = vY, method = 'BFGS', hessian=TRUE)
optimLogit$par
# [1] 1.386305

(1/(1+exp(-(optimLogit$par))))
# [1] 0.8000017

# minimize the negative log-likelihood for SIN link
optimSin   = optim(vBeta0, logLikelihoodSin  , mX = mX, vY = vY, method = 'BFGS', hessian=TRUE)
optimSin$par
# [1] 0.6439062

(sin(optimSin$par) + 1) / 2
# [1] 0.800162

EDIT - August 26, 2014

I forgot that when only estimating one parameter it is better to use method='Brent'.

When I do that I obtain better point estimates. Both links now return a point estimate matching the expected value:

It might be okay to use lower = 0, upper = 1 with optimSin.b.

optimLogit.b = optim(vBeta0, logLikelihoodLogit, mX = mX, vY = vY, method='Brent', lower = -20, upper = 20)
optimLogit.b

(1/(1+exp(-(optimLogit.b$par))))
# [1] 0.8

optimSin.b = optim(vBeta0, logLikelihoodSin, mX = mX, vY = vY, method='Brent', lower = -20, upper = 20)
optimSin.b

(sin(optimSin.b$par) + 1) / 2
# [1] 0.8
Mark Miller
  • 639
  • 7
  • 12
  • 2
    An example of how to implement a complicated link for a GLM in `R` appears in my answer at http://stats.stackexchange.com/a/64039. It avoids implementing the likelihood explicitly, leaving as much work as possible to code that's already incorporated in `R`. – whuber Aug 26 '14 at 14:40