3

Does anyone know if there any packages in common statistical computing software (e.g. R) that have the ability to simulate realistic random data with statistical patterns?

It's quite straightforward to simulate random data that does not contain any statistical patterns, for example:

var_1 = rnorm(100,10,1)
var_2 = rnorm(100,5,5)
var_3 = rnorm(100,1,1)

id = 1:100

response <- c(0,1)

response_var <- as.factor(sample(response, 100, replace=TRUE, prob=c(0.5, 0.5)))

my_data = data.frame(id, var_1, var_2, var_3, response)

       id     var_1    var_2     var_3 response
1  1 10.008459 4.698752 0.6666546        0
2  2 10.471192 6.710553 2.6666892        1
3  3  9.901345 3.899702 0.6533916        0
4  4 10.343638 5.633607 0.5578606        1
5  5  8.560387 0.662563 0.8842000        0
6  6 10.055957 1.522140 0.8124197        1

But are there any ways to simulate this kind of data so that are there "statistical patterns"? For example, response class "0" is more associated with larger values of var_1 and smaller values of var_2 and var_3? Or any general methods to simulate clustered data containing groups of statistically similar individuals?

Of course, if you spend enough time, you can figure out how to do this manually by simulating multiple datasets and combining them together - but are there any statistical packages that allow you to do something like this for datasets containing many variables?

Thanks!

Note: As an example, I included an example of "crescent shaped" data being simulated with random noise (using R) and a random forest model being used to predict this data - but the data itself is still quite simplistic and doesn't contain the type of statistical patterns/clusters that I want:

#load library
library(RSSL)
library(ggplot2)
library(mlr)

#generate first data
d <- generateCrescentMoon(1000,2,1)
d$c = ifelse(d$Class == "+", "1","0")
d$Class = NULL

#generate second data
c <- sample(0:1, 500, TRUE)

X1 <- runif(500, min=-5, max=0)
X2 <- runif(500, min=-10, max=10)

a = data.frame(X1,X2,c)
a$c = as.factor(a$c)

g = rbind(a,d)

ggplot(g, aes(x=X1, y=X2, color=c, shape=c)) +  geom_point()


#mlr

aa = makeClassifTask(data = g, target = "c")

#specify and train machine learning algorithms
learners = list(
    makeLearner("classif.svm", kernel = "linear"),
    makeLearner("classif.svm", kernel = "polynomial"),
    makeLearner("classif.svm", kernel = "radial"),
    "classif.rpart",
    "classif.randomForest",
    "classif.knn"
)

plotLearnerPrediction(learner = learners[[5]], task = aa)
 plotLearnerPrediction(learner = learners[[4]], task = aa)

enter image description here

Additional References:

stats_noob
  • 5,882
  • 1
  • 21
  • 42
  • 4
    What kind of statistical clusters/patterns **do** you want to simulate? – Alexis Dec 26 '21 at 04:50
  • Related: [Datasets constructed for a purpose similar to that of Anscombe's quartet](https://stats.stackexchange.com/q/80196/1352), especially [gung's answer](https://stats.stackexchange.com/a/80198/1352). – Stephan Kolassa Dec 26 '21 at 07:10
  • Check my answer [here](https://stats.stackexchange.com/a/558172/11852), some of the application is exactly for medical records data. – usεr11852 Dec 26 '21 at 11:01

1 Answers1

8

Since "statistical patterns" is an infinitely broad category, this is an overly broad question. In terms of the available software packages, this depends a great deal on what statistical model you wish to simulate from. Here are a few standard ones you might be interested in, and how to implement them in R.


Continuous data

  • Multivariate Gaussian: To generate data from this distribution (with some specified correlation matrix) you can use the rmvnorm function in the mvtnorm package.

  • Regression models: If you have already generated the explanatory variables for a regression model (e.g., with the multivariate Gaussian distribution) you can then generate the response variable in the model directly using the model equation using a randomly generated error term. This allows you to generate data from Gaussian regression models, logistic regression models, other GLMs, etc.

  • Time-series models: To generate data from the stationary Gaussian ARMA model (with some specified parameters) you can use the rGARMA function in the ts.extend package (see also O'Neill 2021).


Discrete data

  • Simple-random-sampling: To generate data using simple-random-sampling from a specified population of values you can use the sample function in the base package. This function can accomodate simple-random-sampling with or without replacement.

  • Balls-in-bins model: To generate data from the extended balls-in-bins model you can use the sample.ballbin function in the occupancy package.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • @ Ben: Thank you so much for your answer! I am trying to think how to explain ... I want to simulate data that could have come from medical patients : each row corresponds to a medical patient, and each column corresponds to measurements from that patient (e.g. height, age , weight, blood type, etc). If possible, there should be some response variable (e.g. disease/no disease) ....and the response should be "more homogeneous" for certain "cohorts" (based on combinations of variables). Is this possible to simulate? Thank you so much! – stats_noob Dec 26 '21 at 04:53
  • 4
    I recommend you ask a new question with all the specifics of what you're trying to create. – Ben Dec 26 '21 at 06:20
  • No copulae? (I thought it would be another standard option; +1 anyway) – usεr11852 Dec 26 '21 at 10:58
  • @usεr11852: If you know the relevant ```R``` packages, you have my blessing to edit this answer to add simulation from copula (or any other relevant models). – Ben Dec 27 '21 at 00:06
  • Oh sweet! I know the `copula::mvdc` functionality but I am not an expert (still learning copula basics). Maybe in a few days. (unless you do it first) – usεr11852 Dec 27 '21 at 00:15
  • @usεr11852: I know even less than that, so I'll leave it to you. Edit away my man! – Ben Dec 27 '21 at 00:16
  • @ Ben: As requested, here is a new question: https://stats.stackexchange.com/questions/558408/simulating-conditional-responses Thanks! – stats_noob Dec 27 '21 at 05:57