Questions tagged [synthetic-data]
74 questions
28
votes
2 answers
What are some standard practices for creating synthetic data sets?
As context: When working with a very large data set, I am sometimes asked if we can create a synthetic data set where we "know" the relationship between predictors and the response variable, or relationships among predictors.
Over the years, I…

Iterator
- 2,294
- 1
- 15
- 22
23
votes
10 answers
Best term for made-up data?
I'm writing an example and have made up some data. I want it to be clear to the reader this is not real data, but I also don't want to give the impression of malice, since it just serves as an example.
There is no (pseudo)random component to this…

Frans Rodenburg
- 10,376
- 2
- 25
- 58
12
votes
1 answer
Creating an Imbalanced Dataset
I would like to have my trained model tested on an imbalanced dataset. Is there any algorithms available to generate synthetic data from a balanced labelled dataset (spam/non-spam)?

Stuart Peterson
- 361
- 1
- 6
8
votes
2 answers
Approaches for generating synthetic survey data with dependent answers?
I would like to produce synthetic survey data. At the moment I produce independent answers between questions according to an arbitrary discrete distribution as in this question.
I want to generate randomly and independently answers to 2 different…

Vass
- 1,425
- 2
- 14
- 20
5
votes
0 answers
Is it possible to determine if a dataset is real or randomly generated?
I've been tasked with developing regression and classification models for time series data. For each observation I have a continuous target for regression and a discrete target for classification.
I've fitted a LASSO model which works stupendously…

rubik
- 177
- 1
- 6
5
votes
2 answers
Synthetic Control Method
I came across this journal http://www.hks.harvard.edu/fs/aabadie/ccsp.pdf which basically uses Synthetic Control Method (SCM) to estimate the difference between the impact on a variable when an event happens versus when it does not happen (well at…

hana
- 51
- 2
4
votes
0 answers
Why do I get 100% error rate in unsupervised random forest, and how do unsupervised patterns work in "randomForest" R package
I tried to use random forest to classify microarray data. Basing on research of L.Breiman and Tao Shi, I constructed a synthetic data base using bootstrap methods (Assuming it is a matrix with samples on row and genes on column, for each gene in…

ccshao
- 597
- 2
- 8
- 14
3
votes
1 answer
Generating non-homogeneous spatial Gaussian data
I want to generate a spatial data following multivariate Gaussian distribution.
However, I don't want it to be homogeneous, meaning I don't want the correlation/covariance to be homogeneous. I want it to be heterogeneous.
Any suggestions on how to…

user31820
- 1,351
- 3
- 20
- 29
3
votes
1 answer
Synthetic data set generation for binary classification based on paper (interpretation problem)
I'm reading a research paper about fraud detection (unbalanced binary classification) where the authors go for synthetic data for evaluating their methods. I want to reproduce their synthetic data but it's description is not entirely clear for me.…

Fredrik
- 671
- 1
- 5
- 8
3
votes
1 answer
Does it make sense to use the KL-divergence between joint distributions of synthetic and real data, as a evaluation metric?
The KL-divergence is defined as:
$D_{KL}(p(x_1)∥q(x_1))=\sum p(x_1)\, \log \Big( \dfrac{p(x_1)}{q(x_1)} \Big)$
I consider the Kullback-Leibler (KL) divergence as a performance metric for data synthesis.
Several studies used the KL divergence as a…

Eui-Jin Kim
- 31
- 2
3
votes
1 answer
Synthetic time series generation
I have a linear model (with seasonal dummy variables) that produces monthly
forecasts. I'm using R together with the 'forecast' package:
require(forecast)
model = tslm(waterflow ~ rainfall + season, data = model.df, lambda = lambda)
forec =…

Fernando
- 741
- 8
- 24
3
votes
0 answers
Is it possible to use a time series to construct a synthetic control variable?
Imagine I want to reconstruct a counterfactual euro-dollar exchange rate in 2010 using a synthetic control variable to assess the impact of some policy. Could I use, for example, the exchange rates from 2000 to 2009 as a sample to create my SC?
More…

raving-bandit
- 31
- 2
3
votes
1 answer
Create synthetic data with a given intraclass correlation coefficient (ICC)?
I want to generate some synthetic data with $I$ observations across $J$ clusters. Additionally, I want the intraclass correlation coefficient ($ICC$) to be an input of my data generation process. So, at the end I want to end-up with a data frame…

Ignacio
- 185
- 7
3
votes
0 answers
How to generate synthetic data with a given $R_{x,y}^2$
I would like to generate some data with the following relationships:
$ y = x\beta + T\delta + \varepsilon $
$ R_{x,y}^2 = a $, where $a$ is a number that I can choose when generating the data
$ \delta = b*\sigma_y $, where $b$ is a number that I…

Ignacio
- 185
- 7
3
votes
1 answer
SMOTE using unbalanced package in R fails on simple simulated data
SMOTE is a popular method to generate synthetic examples of the minority class in an unbalanced-class data set.
I am trying out SMOTE in the "unbalanced" package in R. I am generating a simple simulate data but SMOTE seems to fail on it. Not sure…

Krrr
- 476
- 6
- 15