Questions tagged [synthetic-data]

74 questions
28
votes
2 answers

What are some standard practices for creating synthetic data sets?

As context: When working with a very large data set, I am sometimes asked if we can create a synthetic data set where we "know" the relationship between predictors and the response variable, or relationships among predictors. Over the years, I…
Iterator
  • 2,294
  • 1
  • 15
  • 22
23
votes
10 answers

Best term for made-up data?

I'm writing an example and have made up some data. I want it to be clear to the reader this is not real data, but I also don't want to give the impression of malice, since it just serves as an example. There is no (pseudo)random component to this…
Frans Rodenburg
  • 10,376
  • 2
  • 25
  • 58
12
votes
1 answer

Creating an Imbalanced Dataset

I would like to have my trained model tested on an imbalanced dataset. Is there any algorithms available to generate synthetic data from a balanced labelled dataset (spam/non-spam)?
Stuart Peterson
  • 361
  • 1
  • 6
8
votes
2 answers

Approaches for generating synthetic survey data with dependent answers?

I would like to produce synthetic survey data. At the moment I produce independent answers between questions according to an arbitrary discrete distribution as in this question. I want to generate randomly and independently answers to 2 different…
Vass
  • 1,425
  • 2
  • 14
  • 20
5
votes
0 answers

Is it possible to determine if a dataset is real or randomly generated?

I've been tasked with developing regression and classification models for time series data. For each observation I have a continuous target for regression and a discrete target for classification. I've fitted a LASSO model which works stupendously…
rubik
  • 177
  • 1
  • 6
5
votes
2 answers

Synthetic Control Method

I came across this journal http://www.hks.harvard.edu/fs/aabadie/ccsp.pdf which basically uses Synthetic Control Method (SCM) to estimate the difference between the impact on a variable when an event happens versus when it does not happen (well at…
hana
  • 51
  • 2
4
votes
0 answers

Why do I get 100% error rate in unsupervised random forest, and how do unsupervised patterns work in "randomForest" R package

I tried to use random forest to classify microarray data. Basing on research of L.Breiman and Tao Shi, I constructed a synthetic data base using bootstrap methods (Assuming it is a matrix with samples on row and genes on column, for each gene in…
ccshao
  • 597
  • 2
  • 8
  • 14
3
votes
1 answer

Generating non-homogeneous spatial Gaussian data

I want to generate a spatial data following multivariate Gaussian distribution. However, I don't want it to be homogeneous, meaning I don't want the correlation/covariance to be homogeneous. I want it to be heterogeneous. Any suggestions on how to…
user31820
  • 1,351
  • 3
  • 20
  • 29
3
votes
1 answer

Synthetic data set generation for binary classification based on paper (interpretation problem)

I'm reading a research paper about fraud detection (unbalanced binary classification) where the authors go for synthetic data for evaluating their methods. I want to reproduce their synthetic data but it's description is not entirely clear for me.…
Fredrik
  • 671
  • 1
  • 5
  • 8
3
votes
1 answer

Does it make sense to use the KL-divergence between joint distributions of synthetic and real data, as a evaluation metric?

The KL-divergence is defined as: $D_{KL}(p(x_1)∥q(x_1))=\sum p(x_1)\, \log \Big( \dfrac{p(x_1)}{q(x_1)} \Big)$ I consider the Kullback-Leibler (KL) divergence as a performance metric for data synthesis. Several studies used the KL divergence as a…
3
votes
1 answer

Synthetic time series generation

I have a linear model (with seasonal dummy variables) that produces monthly forecasts. I'm using R together with the 'forecast' package: require(forecast) model = tslm(waterflow ~ rainfall + season, data = model.df, lambda = lambda) forec =…
Fernando
  • 741
  • 8
  • 24
3
votes
0 answers

Is it possible to use a time series to construct a synthetic control variable?

Imagine I want to reconstruct a counterfactual euro-dollar exchange rate in 2010 using a synthetic control variable to assess the impact of some policy. Could I use, for example, the exchange rates from 2000 to 2009 as a sample to create my SC? More…
3
votes
1 answer

Create synthetic data with a given intraclass correlation coefficient (ICC)?

I want to generate some synthetic data with $I$ observations across $J$ clusters. Additionally, I want the intraclass correlation coefficient ($ICC$) to be an input of my data generation process. So, at the end I want to end-up with a data frame…
Ignacio
  • 185
  • 7
3
votes
0 answers

How to generate synthetic data with a given $R_{x,y}^2$

I would like to generate some data with the following relationships: $ y = x\beta + T\delta + \varepsilon $ $ R_{x,y}^2 = a $, where $a$ is a number that I can choose when generating the data $ \delta = b*\sigma_y $, where $b$ is a number that I…
Ignacio
  • 185
  • 7
3
votes
1 answer

SMOTE using unbalanced package in R fails on simple simulated data

SMOTE is a popular method to generate synthetic examples of the minority class in an unbalanced-class data set. I am trying out SMOTE in the "unbalanced" package in R. I am generating a simple simulate data but SMOTE seems to fail on it. Not sure…
Krrr
  • 476
  • 6
  • 15
1
2 3 4 5