Generate random data for logistic regression with a categorical independent variable

Question

I am trying to generate a data frame of fake data for exploratory purposes. Specifically, I am trying to produce data with a binary dependent variable (say, failure/success), and a categorical independent variable called 'picture' with 5 levels (pict1, pict2, etc.). I am following the answer provided here, which allows me to successfully generate the data. However, I need each level of 'picture' to occur the same number of times (i.e. 11 repetitions of each level = 55 total observations per subject).

Here is a reproducible example of what has worked so far (code from user: ocram):

library(dummies)

#------ parameters ------
n <- 1000 
beta0 <- 0.07
betaB <- 0.1
betaC <- -0.15
betaD <- -0.03
betaE <- 0.9
#------------------------

#------ initialisation ------
beta0Hat <- rep(NA, 1000)
betaBHat <- rep(NA, 1000)
betaCHat <- rep(NA, 1000)
betaDHat <- rep(NA, 1000)
betaEHat <- rep(NA, 1000)
#----------------------------

#------ simulations ------
for(i in 1:1000)
{
  #data generation
  x <- sample(x=c("pict1","pict2", "pict3", "pict4", "pict5"), 
              size=n, replace=TRUE, prob=rep(1/5, 5))  #(a)
  linpred <- cbind(1, dummy(x)[, -1]) %*% c(beta0, betaB, betaC, betaD, betaE)  #(b)
  pi <- exp(linpred) / (1 + exp(linpred))  #(c)
  y <- rbinom(n=n, size=1, prob=pi)  #(d)
  data <- data.frame(picture=x, choice=y)

  #fit the logistic model
  mod <- glm(choice ~ picture, family="binomial", data=data)

  #save the estimates
  beta0Hat[i] <- mod$coef[1]
  betaBHat[i] <- mod$coef[2]
  betaCHat[i] <- mod$coef[3]
  betaDHat[i] <- mod$coef[4]
  betaEHat[i] <- mod$coef[5]
}

However, as you can see from the output, each level of the factor 'picture' does not occur the same number of times (i.e. 200 times each).

> summary(data)
picture     choice     
pict1:200   Min.   :0.000  
pict2:207   1st Qu.:0.000  
pict3:217   Median :1.000  
pict4:163   Mean   :0.559  
pict5:213   3rd Qu.:1.000  
            Max.   :1.000

Moreover, it is not entirely clear to me how to manipulate the initial beta values as to determine the probability of success/failure for each level of 'picture'. I cannot comment the original question because I do not yet have the necessary reputation points.

I'm still working on this, just in case anybody has any more suggestions. — babylinguist, Sep 25 '14 at 15:28

score 3 · Answer 1 · edited Sep 22 '14 at 18:28

3

If you want 200 copies of each of 5 levels in a polytomous variable in random order, then do this instead:
```
x <- sample(rep(paste0('pict', 1:5), 200))
```

If you want to control for overall prevalence of a specific outcome, then you must choose which beta you will fudge. I usually do beta0.

MM         <- model.matrix(~x)
betas      <- rnorm(4)
prevTarget <- 0.3
prevDiff   <- function(beta0)  prevTarget - 
                               mean(binomial()$linkinv(MM%*%c(beta0, betas)))
beta0      <- uniroot(prevDiff, c(-100, 100))$root
mean(binomial()$linkinv(MM%*%c(beta0, betas)))

edited Sep 22 '14 at 18:28

gung - Reinstate Monica

132,789
81
357
650

answered Sep 22 '14 at 18:12

AdamO

52,330
5
104
209

Could you give a little more detail as to how I can use your answer in conjunction with the code I provided above? If it is not to be used with the code above, how exactly can I generate a data frame from your code? – babylinguist Sep 22 '14 at 18:38
@jvcasill not following your question. There's a lot of replication, e.g. using `model.matrix` instead of `dummy` and `binomial()$linkinv` instead of a by-hand expit. The argument to mean in the last line is the fitted values, whence you can call `rbinom`. May I suggest you run the two coding examples line-by-line and figure out what you're trying to do? – AdamO Sep 22 '14 at 18:48
I am trying to generate a data frame similar to the one the original code generates. However I want an equal amount of observations for each level of the factor 'picture'. I also want to be able to manipulate the probability of success/failure at each level. It is clear to me that your code does something, but I don't understand how to use it for what I am asking. Does that make sense? – babylinguist Sep 22 '14 at 18:49
Append `y = rbinom(1000,1,binomial()$linkinv(MM%*%c(beta0, betas)))`, and `data – AdamO Sep 22 '14 at 19:22
My apologies for being a hassle but your responses are not clear to me and I don't have a reproducible example that solves my question. Could you edit your original answer to include the appropriate changes? – babylinguist Sep 22 '14 at 19:36
1

@jvcasill Please look here. https://github.com/aomidpanah/simulations/blob/master/logregsim.r Given the work you are trying to do, I think it would be good for you to explore this in some depth and better understand how these functions are working. As I suggested, line-by-line analysis and scansion will be a very good way to do this. Also examine the output from `proc.time`. – AdamO Sep 22 '14 at 19:58
Thank you for your time. After taking a quick look, I believe this will be much more useful, as I am able to see what changes you made to the original code and where (append this and append that is rather opaque). I will take a more in depth look this afternoon, but as it stands your code still does not produce the data frame I am after. – babylinguist Sep 22 '14 at 20:08
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/17363/discussion-between-jvcasill-and-adamo). – babylinguist Sep 22 '14 at 23:01

score 0 · Answer 2 · answered Nov 29 '14 at 16:11

I eventually found an acceptable answer to my question. This may or may not be the best/most sophisticated way to handle this, but it enabled me to get what I was after.

set.seed(1)

intercept1 = -6.0
beta1      = 2.75
xtest1     = rnorm(1300, 1, 3)
linpred1   = intercept1 + xtest1 * beta1
prob1      = exp(linpred1)/(1 + exp(linpred1))
runis1     = runif(1300, 0, 1)
ytest1     = ifelse(runis1 < prob1, 1, 0)

intercept2 = -7.0
beta2      = 2.75
xtest2     = rnorm(1300, 1, 3)
linpred2   = intercept2 + xtest2 * beta2
prob2      = exp(linpred2)/(1 + exp(linpred2))
runis2     = runif(1300, 0, 1)
ytest2     = ifelse(runis2 < prob2, 1, 0)

intercept3 = -7.0
beta3      = 2.75
xtest3     = rnorm(1300, 1, 3)
linpred3   = intercept3 + xtest3 * beta3
prob3      = exp(linpred3)/(1 + exp(linpred3))
runis3     = runif(1300, 0, 1)
ytest3     = ifelse(runis3 < prob3, 1, 0)


newdf1 <- data.frame(prop = ytest1, x = xtest1, week = "w1")
newdf1 <- newdf1[with(newdf1, order(week, x)), ]
newdf1$stim <- rep(seq(-60, 60, by = 10), each = 100)

newdf2 <- data.frame(prop = ytest2, x = xtest2, week = "w3")
newdf2 <- newdf2[with(newdf2, order(week, x)), ]
newdf2$stim <- rep(seq(-60, 60, by = 10), each = 100)

newdf3 <- data.frame(prop = ytest3, x = xtest3, week = "w6")
newdf3 <- newdf3[with(newdf3, order(week, x)), ]
newdf3$stim <- rep(seq(-60, 60, by = 10), each = 100)

temp <- rbind(newdf1, newdf2)
df <- rbind(temp, newdf3)

I am now able to manipulate the data frame, perform logistic regression, and produce plots. Thank you to everybody who provided input.

enter image description here

You don't need the `runif()` & `ifelse()` calls, you can just use `rbinom()`. You may also want to define a new function `lo2p = function(lo){ exp(lo)/(1+exp(lo)) }` for clearer code & to save you some typing. — gung - Reinstate Monica, Nov 30 '14 at 03:01

Generate random data for logistic regression with a categorical independent variable

2 Answers2