Does anyone know if there any packages in common statistical computing software (e.g. R) that have the ability to simulate realistic random data with statistical patterns?
It's quite straightforward to simulate random data that does not contain any statistical patterns, for example:
var_1 = rnorm(100,10,1)
var_2 = rnorm(100,5,5)
var_3 = rnorm(100,1,1)
id = 1:100
response <- c(0,1)
response_var <- as.factor(sample(response, 100, replace=TRUE, prob=c(0.5, 0.5)))
my_data = data.frame(id, var_1, var_2, var_3, response)
id var_1 var_2 var_3 response
1 1 10.008459 4.698752 0.6666546 0
2 2 10.471192 6.710553 2.6666892 1
3 3 9.901345 3.899702 0.6533916 0
4 4 10.343638 5.633607 0.5578606 1
5 5 8.560387 0.662563 0.8842000 0
6 6 10.055957 1.522140 0.8124197 1
But are there any ways to simulate this kind of data so that are there "statistical patterns"? For example, response class "0" is more associated with larger values of var_1 and smaller values of var_2 and var_3? Or any general methods to simulate clustered data containing groups of statistically similar individuals?
Of course, if you spend enough time, you can figure out how to do this manually by simulating multiple datasets and combining them together - but are there any statistical packages that allow you to do something like this for datasets containing many variables?
Thanks!
Note: As an example, I included an example of "crescent shaped" data being simulated with random noise (using R) and a random forest model being used to predict this data - but the data itself is still quite simplistic and doesn't contain the type of statistical patterns/clusters that I want:
#load library
library(RSSL)
library(ggplot2)
library(mlr)
#generate first data
d <- generateCrescentMoon(1000,2,1)
d$c = ifelse(d$Class == "+", "1","0")
d$Class = NULL
#generate second data
c <- sample(0:1, 500, TRUE)
X1 <- runif(500, min=-5, max=0)
X2 <- runif(500, min=-10, max=10)
a = data.frame(X1,X2,c)
a$c = as.factor(a$c)
g = rbind(a,d)
ggplot(g, aes(x=X1, y=X2, color=c, shape=c)) + geom_point()
#mlr
aa = makeClassifTask(data = g, target = "c")
#specify and train machine learning algorithms
learners = list(
makeLearner("classif.svm", kernel = "linear"),
makeLearner("classif.svm", kernel = "polynomial"),
makeLearner("classif.svm", kernel = "radial"),
"classif.rpart",
"classif.randomForest",
"classif.knn"
)
plotLearnerPrediction(learner = learners[[5]], task = aa)
plotLearnerPrediction(learner = learners[[4]], task = aa)
Additional References: