Data observed with multiple but unpaired/unmatching observations

Question

I have a dataset with repeated measures, in which for each unit $i$, we observe multiple observations (for the same variable) $y_{ij}$ and multiple $x_{ik}$, but we don't know which $y_{ij}$ observations correspond to which $x_{ik}$ observations, (i.e. we don't necessarily have $j=k$). In fact, I usually observe 3 $y_{ij}$ and 10 $x_{ik}$ for each $i=1 \ldots N, N \approx 1000$. Written differently, my data is like $\{ \{y_{1a}, y_{1b}, y_{1c}; x_{1,1}, \ldots x_{1,10}\}, \ldots, \{y_{Na}, y_{Nb}, y_{Nc}; x_{N,1}, \ldots x_{N,10}\}\}$

The goal of the analysis is to run a regression of y on x. My assumption is that the true model is $y_{ij}=\alpha + \beta x_{ij}+\epsilon_{ij}$. I can't run this regression since I don't know the pairing of each j and k. For now I average observations for each unit, and regress $y_{i\cdot}\sim x_{i\cdot}$. I believe this leads to an unbiased/consistent estimator only if $K\to \infty$ and $J\to \infty$, which is not my case (for me $J=3$).

My questions:

Is there a more specific name for this kind of unpaired/unmatched data ?
More importantly, is there a literature associated to this? Can I do better than averaging my $x_i$ and $y_i$? One idea could be to match quantiles (over i) of $y_{ij}$ to those of $x_{ik}$, or run the regression for a lot of combinations of $y_{ij}$ and $x_{ik}$... Or do some clustering?

Quick question: Is it true the the $x$'s and the $y$'s are coming from the same "generators" (i.e: $x_{ik}$ ~ $X$ for every $i$ and $k$ and $y_{ik}$ ~ $Y$ for every $i$ and $j$)? If not, is it at least true that $x_{ik}$ ~ $X_k$ for every $k$ and $y_{ij}$ ~ $Y_j$ for every $j$? — Vasilis Vasileiou, Nov 29 '17 at 18:07
I didn't think too much about this... I guess I would rather assume that $x_{ik}\sim f(\mu_i)$, and $y_{ik}\sim g(\mu_i)$. But clearly, I'll need to assume something a true model (pairing observed) like $y_{ij}=\alpha + \beta x_{ij}+\epsilon_{ij}$ (eventually $\alpha_i$). What are the implications of your question actually @VasilisVasileiou? — Matifou, Nov 29 '17 at 18:52
Maybe related: https://stats.stackexchange.com/questions/25941/t-test-for-partially-paired-and-partially-unpaired-data — kjetil b halvorsen, Sep 05 '19 at 21:06

show_stopper · Answer 1 · 2017-11-30T19:01:16.963

0

For each Y_i randomly select a j and for the corresponding X_i randomly select a k. Now fit the model Y ~ mX + c. Now repeat the above for 1000 iteration by randomly selecting values of j and k for each i. At the end, average the models and provide confidence interval.

edited Nov 30 '17 at 19:01

answered Nov 29 '17 at 17:41

show_stopper

415
4
14

Thanks for the answer. Can you clarify what you mean by "for each y_ij, fit a linear model with each x_ik". If it's for each y_ij, then I'll have only one observation? And if I were to take all y_ij for a single fit, the issue is that the coefficients should be different (i.e. for $y_{1a}$ say $\beta_2$ should be positive, while for $y_{1b}$ it might be $\beta_1$ that is? Which underlying model do you have in mind? – Matifou Nov 29 '17 at 18:57
given that you observe 3 y for each i, and 10 x for each i, I am assuming that each of the 3 y are different attributes, and each of the 10 x are different attribute. For simplicity of notation, Y_j is the set of y observations for all individuals for a given j. Now you have Y_1, Y_2, Y_3. Do the same thing by defining X_1 , X_2 ... X_10 from x_ik. Now fit each Y_j using a given X_k, and select the X_k which has the most significant coefficient. – show_stopper Nov 29 '17 at 20:03
oh sorry for the confusion, these are only repeated measures of one variable, not single measure of multiple variables. – Matifou Nov 29 '17 at 20:23
ok. Got it. Is there any other factor involved such as time? i.e are all observation of x and y taken at the same time, or all taken at different time, or are all observations of y taken at one time while all the observations of x taken at another? – show_stopper Nov 29 '17 at 21:00
Time is not involved, all taken at the same time here. – Matifou Nov 30 '17 at 18:46
In this case you can do the following. For each Y_i randomly select a j and for the corresponding X_i randomly select a k. Now fit the model Y ~ mX + c. Now repeat the above for 1000 iteration by randomly selecting values of j and k for each i. At the end, average the models and provide confidence interval. Let me update the above answer with this. – show_stopper Nov 30 '17 at 19:00
I doubt this sensibly takes into account the variance of the measurements. – Neuneck Dec 05 '17 at 09:48

Neuneck · Answer 2 · 2017-12-05T09:51:56.147

For every specimen $i$ you can compute means and error bars for your observations:

$\bar y \pm \sigma_y$ and $\bar x \pm \sigma_x$

It is then usual (as implemented in the python statsmodels package, weighted least squares class) to perform a regression on the $\bar y, \bar x$ where the contribution of each specimen to the error function is weighted inversely with the standard deviation of the specimen's observations.

This way, the uncertainty in the $x_i, y_i$ is taken into account when regressing, the result is reproducable and limites for $K \to \infty, J \to \infty$ are sensible. Moreover, this method obviousely scales to more specimen and more observations.

score 0 · Answer 3 · answered Dec 05 '17 at 06:21

Using averages is just a between estimator in the literature on panel data analysis.

Using averages is also the main justification for variance weights, e.g. the explanation of aweights in Stata. (as in Neuneck's answer.)

Another related issue is estimation with aggregates, e.g. if you only observe group totals, where groups could be states or countries or industries

Consistency is obtained as N goes to infinity, K, and J can be fixed and finite.

My guess is that you cannot do better under your assumption that the linear model is correctly specified if you have enough variation of x across units. Without within variation it would be a problem with unobserved heterogeneity, e.g. if there is a unit specific effect or there is not enough between information, e.g. every one has the same number of repeat observations for each levels of a categorical variable.

Data observed with multiple but unpaired/unmatching observations

3 Answers3