3

I would like to generate some data with the following relationships:

  1. $ y = x\beta + T\delta + \varepsilon $
  2. $ R_{x,y}^2 = a $, where $a$ is a number that I can choose when generating the data
  3. $ \delta = b*\sigma_y $, where $b$ is a number that I can choose when generating the data
  4. $T \in {\{0,1\}}$

For example, suppose that $a=0.5$ and $b=0.1$. I can generate synthetic data with R like this:

set.seed(9782); n <- 1000 ; r <- 0.5; a <- sqrt((1 - r)/(1 + r))

x <- rnorm(n = n, mean = 5, sd = 1) ; e <- rnorm(n = n, mean = 0, sd = 1) ; 
t<-sample(x = c(0,1), size = n, replace = T)

y <- a * x + e

y <- y + 0.1*sd(y)*t

but my guess is that there is a better way of doing this. Any suggestions?

$T$ is random, $\varepsilon$ is distributed normal centered at 0, I think i have to leave the variance of $\varepsilon$ free to make this work. Similarly, I think i have to let the value of $\beta$ free to make this work. If only one of these parameters have to be free, i would like to be able to choose beta and let $\varepsilon$ as a free parameter. In my code, the idea is that r is the proportion of the variation in $y$ explained by $x$ ($R^2_{x,y}$), in practice that is not really working. Yes, I want $T$ and $x$ to be orthogonal. I want $R^2_{x,y}$ to be the partial correlation.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
Ignacio
  • 185
  • 7
  • Could you please explain your notation? It is unclear which variables are fixed, which are user inputs, which ones are functions of others, which are intended to be random, how they might be interdependent, or what their distributions might be. What exactly do you mean by "$R^2_{x,y}$? – whuber Dec 06 '16 at 16:36
  • $R^2_{x,y}$ is the proportion of the variation of the outcome explained by $x$. The only inputs are $a$, that is $R^2_{x,y}$, and $b$, that is the effect of $T$ on $y$ as a proportion of the standard deviation of $y$ for $T=0$. – Ignacio Dec 06 '16 at 17:01
  • Since the code doesn't appear to work correctly, it's difficult to rely on it to interpret what you want. Is $T$ intended to be random or an input parameter? What will you assume about the distribution $\varepsilon$? What will determine the value of $\beta$? What values should $x$ take on? What is the purpose of `r` in the code? Your comment appears implicitly to assume that $T$ will be orthogonal to $x$--must that indeed be the case? If not, then is $R^2_{x,y}$ based on the *univariate* correlation between $x$ and $y$ or is it a *partial* correlation? – whuber Dec 06 '16 at 18:03
  • 1
    $T$ is random, $\varepsilon$ is distributed normal centered at 0, I think i have to leave the variance of $\varepsilon$ free to make this work. Similarly, I think i have to let the value of $\beta$ free to make this work. If only one of these parameters have to be free, i would like to be able to choose beta and let $\varepsilon$ as a free parameter. In my code, the idea is that `r` is the proportion of the variation in $y$ explained by $x$ ($R^2_{x,y}$), in practice that is not really working. Yes, I want $T$ and $x$ to be orthogonal. I want $R^2_{x,y}$ to be the partial correlation. Thanks! – Ignacio Dec 06 '16 at 18:10

0 Answers0