Selecting the most similar subset from an alternative dataset

Question

Background: I have two different datasets coming from two different source. Dataset A has m features (e.g. v1,v2,v3,v4) with let us say 1 million instances. Second dataset B has n features some of which are same as in A and then some (e.g. v1,v2,v3,v4,v5,v6,v7,v8,v9) and 250,000 instances. Although B is a smaller dataset, it is more diverse than A.

Problem:I want to evaluate for the dataset A, how the remaining the features i.e. v5,v6,v7,v8 and v9 will look like.

Approach:I intend to select a subset from B whose features (v1,v2,v3 and v4) that is representative of A in terms of range, mean, standard deviation, distribution. How can I pick the best subset from B that mimics A ?

I tried selecting the subset based on ranges of each of m features in B, but I see that the distributions of several features are different than those of A.

What is this type of problem called so I can do more research on this topic?

Because you have asserted "m is a subset of n"--and it would seem that should be interpreted as meaning A is a subset of B--then A would be the best subset of B that mimics A. Something's obviously wrong with this interpretation. Could you edit this post to clear up that confusion? — whuber, Jun 19 '15 at 14:04
Thank you for the edits. I still cannot come up with an alternative interpretation. You now show explicitly how A is a subset of B and you still ask to select a subset from all of B that is most like A. Again, A obviously is the answer. — whuber, Jun 19 '15 at 14:29
@whuber thanks. The problem is that A does not come from B. I want to select a subset from B that is representative of A and then evaluate features v5, v6, v7, v8, v9. — earthlink, Jun 19 '15 at 15:31
I'm starting to follow, but your notation is not helping. Are you saying that A consists of $m$ features--let's call them $A_1, A_2, \ldots, A_m$--and B consists of $n\gt m$ possibly different features $B_1,B_2,\ldots,B_n$? Then you wish to select a an $m$-subset of $\{1,2,\ldots,n\}$, say $\{i_1,i_2,\ldots,i_m\}$, for which $B_{i_1}, B_{i_2}, \ldots, B_{i_m}$ best "mimics" A? If so, please adjust your notation to make this evident. At the same time, give us some details concerning what it really means for one set to "mimic" or "be representative of" another one. How is that to be measured? — whuber, Jun 19 '15 at 16:01

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

This problem seem to be the same as selecting control groups for case-control studies. In the famous Doll and Hill study of smoking and lung cancer, the authors identified patients with lung cancer (cases) and then examined a control group "deliberately selected to be closely comparable in age and sex with the carcinoma of the lung patients." The higher incidence of smoking, particularly heavy smoking, in the lung cancer cases provided important evidence of the relation between that behavior and the disease.

So in your situation your group A represents the cases, and group B represents those from whom you will choose your controls.

This is not an easy problem in general, as it requires careful definition of what you mean by "representative." This Cross Validated page provides a good entry into algorithms for making such choices.

Danica · Answer 2 · 2015-06-19T20:42:45.550

EdM's answer seems like a good match, and you should absolutely look into that literature. Here's a possible alternative approach to explore.

One reasonable way of measuring distances between distributions is known as the maximum mean discrepancy (MMD); a thorough overview is given by Gretton, Borgwardt, Rasch, Schölkopf, and Smola, A Kernel Two-Sample Test, JMLR 2012. The basic idea is to embed distributions into a reproducing kernel Hilbert space (RKHS), and then get distances between those distributions in that RKHS. If the kernel of the RKHS is $k(x, y) = \langle \varphi(x), \varphi(y) \rangle$, you can estimate the distance between the distributions as (letting $n = |A|$, $m = |B|$): $$\Big\| \frac1n \sum_{i=1}^n \varphi(A_i) - \frac1m \sum_{i=1}^m \varphi(B_j)\Big\|.$$ You can compute this using only $k$ via the kernel trick, but for large-scale problems it's easier to use a finite-dimensional approximation $z$ with $z(x) \in \mathbb{R}^D$ such that $z(x)^T z \approx \langle \varphi(x), \varphi(y) \rangle$. A popular example of such an approximation was given by Rahimi and Recht, Random Features for Large-Scale Kernel Machines, NIPS 2007. If you use use the very popular Gaussian kernel $k(x, y) = \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)$, then $k(x, y) \approx z(x)^T z(y)$ where $$ z(x) = \begin{bmatrix} \sin(\omega_1^T x) & \cos(\omega_1^T x) & \dots & \sin(\omega_D^T x) & \cos(\omega_D^T x) \end{bmatrix} $$ for $\omega_i \sim N(0, \frac{1}{\sigma} I)$ (using the same $\omega$ values for each $x$). (This isn't exactly the version given in the linked version of the paper, but this version is better.) The estimate of the MMD is thus as above, except with $z$ instead of $\varphi$. You'll need to pick an embedding dimension; for this scale of data, maybe 5000 is good. You'll also need to pick a $\sigma$; a reasonable rule of thumb is maybe the median of the pairwise distances between elements of $B$.

Now, you want to find the subset of $B$ whose distribution is closest to that of $A$. We can write the problem by defining a vector $\alpha \in \{0, 1\}^m$ where $\alpha_i$ determines if $B_i$ is included or not:

\begin{align}\DeclareMathOperator*{\argmin}{\arg\min} \alpha^* &= \argmin_{\alpha \in \{0, 1\}^m} \Big\| \frac{1}{n} \sum_{i=1}^n z(A_i) - \frac{1}{1^T \alpha} \sum_{i=1}^m \alpha_i z(B_i) \Big\|^2 \end{align} where we make sure to treat the case $\alpha = 0$ as infinite. Now, let $Y_i = z(B_i) - \frac{1}{n} \sum_{i=1}^n z(A_i)$, i.e. $Y_i$ is the distance of $B_i$'s embedding from our target vector, and we want to find an assignment $\alpha$ that minimizes the mean distance: \begin{align} \alpha^* &= \argmin_{\alpha \in \{0, 1\}^m} \Big\| \frac{1}{1^T \alpha} \sum_{i=1}^m \alpha_i Y_i \Big\|^2 = \argmin_{\alpha \in \{0, 1\}^m} \left\lVert \frac{Y \alpha}{1^T \alpha} \right\rVert^2 \end{align} where $Y = \begin{bmatrix} Y_1 & \dots & Y_m \end{bmatrix}$.

The 1d, integer-valued data version of this problem is pretty close to subset sum, so it's probably NP-hard. We'll have to approximate. If we fix $1^T \alpha$, then this becomes a binary quadratic program. There's been some work on solving these approximately, seemingly mostly (in a quick search) from the computer vision community. It still seems hard to solve, though.

But: in your case, I think you'd be happy with weights assigned to $B$, yes? You could then do weighted density estimation or whatever to try to figure out the distribution of the unknown components of $A$. In that case, fix $1^T \alpha = 1$. Our problem becomes $$ \argmin_{\alpha \in [0, 1]^m} \left\lVert Y \alpha \right\rVert^2 \quad\text{s.t.}\; 1^T \alpha = 1 .$$ This is a simple quadratic program with linear constraints, for which there are many solvers.

I actually just remembered that Hino and Murata, Information estimators for weighted observations, Neural Networks 2013 did something similar in their section 5.2, "Application to distribution-preserving data compression" using their nearest-neighbor-type estimator of the KL divergence. The problem there is more difficult both computationally and in terms of effort to code it up, though. (They don't say this, but it's actually a convex maximization, so they're not even guaranteed to get the global optimum.)

Another approach entirely is to think of this as a regression problem. Train a predictor for $(v_5, \dots, v_9)$ using $(v_1, \dots, v_4)$ as features on the training set $B$, and then apply the predictor to each data point from $A$. This is in some sense "more work" than what you're trying to do, but it might be more useful in the end. You could use transductive approaches to do so; Quadrianto, Patterson, and Smola, Distribution Matching for Transduction, NIPS 2009 even used an approach similar to the distribution matching scheme above to try to solve this problem.

score 0 · Answer 3 · answered Jun 19 '15 at 18:31

0

Thanks to EdM, this problem is called matching. And I found an excellent package in R called Matching for this purpose. I found an excellent resource in this related document Genetic Matching for Estimating Causal Effects!.

answered Jun 19 '15 at 18:31

earthlink

285
1
10

Selecting the most similar subset from an alternative dataset

3 Answers3