Questions tagged [data-generating-process]
18 questions
18
votes
1 answer
Why do we use term “population” instead of “Data-generating process”?
I have always been confused about the use of the term “population” in statistics. In my first statistics course I was taught that we need a sample, because surveying the whole population is too costly. So there is the whole population and there is…

Moysey Abramowitz
- 284
- 1
- 7
14
votes
2 answers
What if there is no true data-generating process?
I've been engaging in a number of forecasting efforts recently, and have rediscovered a well-known truth: That combinations of different forecasts are generally better than the forecasts themselves. In particular, the unweighted mean of forecasts is…

andrewH
- 2,587
- 14
- 27
7
votes
1 answer
What's the DGP in causal inference?
This question come from this discussion (Under which assumptions a regression can be interpreted causally? ). That discussion touch too arguments and is not possible to speak about all things there. So I pose the question here I give my answer too.

markowitz
- 3,964
- 1
- 13
- 28
5
votes
1 answer
Population vs. Data-Generating Process
A lot of elementary statistical and econometric literature bases on so called "population models". An example is econometric handbook "Introductory Econometrics: A Modern Approach" by J. M. Wooldridge. or very influential paper of P. W. Holland…

cure
- 1,666
- 1
- 7
- 19
2
votes
0 answers
Gonzalo's cointegration DGP
I am trying to reprogram the data generating process from Haug (1996) which follows this equation:
Parameter a1 can take values (0,1); a2 is fixed at value 1 and their purpose is to set exogeneity or endegeneity. How can I implement the second…

totnan
- 29
- 3
2
votes
0 answers
Statistical model of a multi-class classifier vs. binary classifiers
Say, a balanced training set containing images that depict either a cat, dog, horse, or panda is given. One trains a machine learning model (e.g., a neural network) to classify whether an image depicts a cat. Then, one wants a model that…

Ronquam
- 31
- 1
2
votes
1 answer
How to calculate expected Average Treatment Effect on the Treated (ATT) from a data generating process?
I'm running comparisons of different counterfactual modeling methodologies (exact matching, propensity score matching, regression, etc.) on simulated data in order to see which methods produce the most precise estimates of the "true" population…

RobertF
- 4,380
- 6
- 29
- 46
1
vote
1 answer
A Data Generating Process Implying Homogeneous Individual Treatment Effects
I want to find a data generating process implying homogeneous individual treatment effects.
Specifically, consider two potential outcomes $y_i^1$ and $y_i^0$.
The first one is the individual $i$'s (potential) outcome if she took a treatment.
and the…

QWEQWE
- 471
- 1
- 9
1
vote
0 answers
Generating realistic mock data for a real-world scenario
Im a final year math student doing a project for a company. They want to do a cup-and-grid water drop test with a firefighting helicopter. The basic idea is: a grid of 1000cups is setup on an airfield, small groups of 9 cups are placed strategically…

kr1s
- 11
- 2
1
vote
0 answers
Change of the data generating process (DGP) vs. occurrence of unseen observations
Say, one trains a machine learning model to classify emails as spam or normal. Then, the adversary (or the collection of all adversaries) represents the data generating process (DGP) that generates the distribution of spam. When the adversary…

Ronquam
- 31
- 1
1
vote
0 answers
Generate data that matches a frequency distribution while preserving the original spatial structure
I am dealing with a 3D array containing values representing the "importance" of each voxel. For my analysis, I would like to synthesize n new arrays from my original array to have a comparison condition. The values in the 3D array can be clustered…

Johannes Wiesner
- 141
- 5
1
vote
0 answers
Interpretation of intercept term in ECM
Suppose two $I(1)$ series $x_t, y_t$ are cointegrated. Therefore $\mu_t$ in following equation is stationary:
\begin{align}
y_t = \beta_0 + \beta_1x_t + \mu_t \tag{1}
\end{align}
Now consider the ECM representation:
\begin{align}
\Delta y_t =…

Dayne
- 2,113
- 1
- 7
- 24
1
vote
1 answer
Econometrics meaning of structural versus regression model
I want to make sure my understanding is correct. Particularly in econometrics, when authors write down a model:
$Y_i = \beta_0 + \beta_1 X_i + \epsilon$
Can I think of this as a 'structural model'- or a linear approximation to the true underlying…

Steve
- 385
- 3
- 10
1
vote
0 answers
Generate data for significance testing
I want to generate a data set with a pre-specified significance level.
Let's say we have 2 covariates x1, x2, and an outcome variable y.
We fit a linear regression model as follow:
create_data <- function(beta_1 = 0.01,
…

Rasel Biswas
- 11
- 4
0
votes
0 answers
Synthetic Dataset Generation for (Hierarchical) Reinforcement Learning
I've been looking at creating an implementation of the MAXQ framework for offline batch hierarchical RL and am in search of a data generator for reinforcement learning. I've seen scikit's (and similar) data generation methods but they doesn't seem…

Scorks
- 1
- 1