1

I have data that looks like the below. I would like to create a model that could answer the question: "If I have a data point with distance x and y people signed up, how many should I expect to check in?" (subject to all the usual hedges like not extrapolating or being too confident in the results, of course).

Signed Up   Checked In  Yield Rate  Distance (km)
274 171 62.41%  0
241 44  18.26%  475.9156416
132 22  16.67%  342.732219
123 53  43.09%  457.3099693
116 20  17.24%  833.4106358
41  20  48.78%  51.19124239
1   0   0.00%   2833.297793
1   0   0.00%   388.5309437
1   0   0.00%   1069.432695
1   1   100.00% 929.646838
1   0   0.00%   1103.6347

Note that yield rate is just (Checked In)/(Signed Up). I tried a basic linear correlation but for pretty obvious reasons, that won't work. What should I do? I've heard of pretty much all the big technologies (R, Python, TensorFlow) but I have very little experience in this space. I'm open to learning though!

Sorry about the poor tagging: I'm so lost with this problem that I'm not even sure what type of problem I'm trying to solve.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Ben Cooper
  • 113
  • 2
  • 1
    From your data, it looks like you want to predict [Yield] from [Signed Up] and [Distance]. You'll want to use a Logistic Regression model to do that, rather than linear regression. – Tim Apr 05 '17 at 02:16
  • 1
    I'm confused to your answer. Correct me if I'm wrong but logistics regression requires categorical variables - I don't have those – Ben Cooper Apr 05 '17 at 07:25
  • 1
    If you want to give the _proportion_ of sign-ups that actually check in, you might want to use something similar to [this approach](http://stats.stackexchange.com/questions/89999/how-to-replicate-statas-robust-binomial-glm-for-proportion-data-in-r). – Tim Apr 05 '17 at 13:29

1 Answers1

0

You should start with plotting your data, obtaining:

enter image description here

The size of the plotting symbols are proportional to the square root of number signed up. We can safely disregard the smallest points, as they corespond to a sample size of 1. Apart from this, there is a nice falling tendency with distance. A logistic regression model could be used:

 mod1 <-  glm( cbind(x, N-x) ~ dist, family=binomial, data=dat)
 summary(mod1)

Call:
glm(formula = cbind(x, N - x) ~ dist, family = binomial, data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.3429  -0.7503  -0.3305   1.8544   3.9497  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.3005630  0.1093824   2.748    0.006 ** 
dist        -0.0028817  0.0002891  -9.969   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 170.235  on 10  degrees of freedom
Residual deviance:  54.917  on  9  degrees of freedom
AIC: 89.177

Number of Fisher Scoring iterations: 4

The deviance here is so large so overdispersion is indicated, so you should maybe replace the binomial family with quasibinomial. If your real data set is much larger, you could maybe represent dist with a spline function.

For brevity, I replaced the names in your data table with

 names(dat)
[1] "N"    "x"    "perc" "dist"
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467