0

I have some data that look like this:

dist <- c(125,12,26,21,52,123)
exists <- c(0,1,0,0,1,0)
df <- cbind(dist, exists)

Basically its a bunch of observations of whether connections exist at different distances. I'd like to fit to this data some kind of probability function that predicts the likelihood of a connection existing given the distance. However, I just can't figure out what to use! Sure this kind of thing is really obvious, but I just don't know where to start looking. Does anyone have any ideas?

unknown
  • 137
  • 1
  • 11

1 Answers1

1

The most common approach would be binomial logistic regression:

> fit <- glm(exists ~ dist, family=binomial)
> summary(fit)

Call:
glm(formula = exists ~ dist, family = binomial)

Deviance Residuals: 
      1        2        3        4        5        6  
-0.3814   1.0765  -1.1318  -1.1846   1.5117  -0.3907  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.54218    1.43780   0.377    0.706
dist        -0.02501    0.02771  -0.903    0.367

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7.6382  on 5  degrees of freedom
Residual deviance: 6.4266  on 4  degrees of freedom
AIC: 10.427

Number of Fisher Scoring iterations: 5

So the predicted probability of a connection is $e^\eta/(1+e^\eta)$ where $\eta=0.54218-0.02501\times \mbox{dist}$.

However, from the data you give, there is no convincing evidence that the probability of a connection actually depends on distance (P=0.367). Examining the Akaike Information Criterion (AIC) for the logistic regression shows a lower information loss for the null model with just an intercept:

> AIC( glm(exists ~ 1, family=binomial) )
[1] 9.63817

So, with this limited amount of data to work with, you would likely be better off ignoring distance when making the prediction.

Gordon Smyth
  • 8,964
  • 1
  • 25
  • 43
  • Amazing, thanks for your answer! I just gave some toy data, with my real data I get a highly significant result (<2e-16)! Very helpful, thanks! – unknown Nov 02 '17 at 23:24
  • Hi, just a quick question about this. If I want to add a binomial predictor (e.g. whether a connection existed at a previous timepoint or not) how would I go about doing that? glm(formula = exists ~ dist + paststate, family = binomial) seems to always give a p-value for paststate as almost 1, even though they are obviously highly correlated. – unknown Nov 06 '17 at 10:00
  • @DomBurns See my answer here: https://stats.stackexchange.com/questions/312137 – Gordon Smyth Nov 06 '17 at 12:05