I would like to generate some data with the following relationships:
- $ y = x\beta + T\delta + \varepsilon $
- $ R_{x,y}^2 = a $, where $a$ is a number that I can choose when generating the data
- $ \delta = b*\sigma_y $, where $b$ is a number that I can choose when generating the data
- $T \in {\{0,1\}}$
For example, suppose that $a=0.5$ and $b=0.1$. I can generate synthetic data with R like this:
set.seed(9782); n <- 1000 ; r <- 0.5; a <- sqrt((1 - r)/(1 + r))
x <- rnorm(n = n, mean = 5, sd = 1) ; e <- rnorm(n = n, mean = 0, sd = 1) ;
t<-sample(x = c(0,1), size = n, replace = T)
y <- a * x + e
y <- y + 0.1*sd(y)*t
but my guess is that there is a better way of doing this. Any suggestions?
$T$ is random, $\varepsilon$ is distributed normal centered at 0, I think i have to leave the variance of $\varepsilon$ free to make this work. Similarly, I think i have to let the value of $\beta$ free to make this work. If only one of these parameters have to be free, i would like to be able to choose beta and let $\varepsilon$ as a free parameter. In my code, the idea is that r
is the proportion of the variation in $y$ explained by $x$ ($R^2_{x,y}$), in practice that is not really working. Yes, I want $T$ and $x$ to be orthogonal. I want $R^2_{x,y}$ to be the partial correlation.