EdM's answer seems like a good match, and you should absolutely look into that literature. Here's a possible alternative approach to explore.
One reasonable way of measuring distances between distributions is known as the maximum mean discrepancy (MMD); a thorough overview is given by Gretton, Borgwardt, Rasch, Schölkopf, and Smola, A Kernel Two-Sample Test, JMLR 2012. The basic idea is to embed distributions into a reproducing kernel Hilbert space (RKHS), and then get distances between those distributions in that RKHS. If the kernel of the RKHS is $k(x, y) = \langle \varphi(x), \varphi(y) \rangle$, you can estimate the distance between the distributions as (letting $n = |A|$, $m = |B|$):
$$\Big\| \frac1n \sum_{i=1}^n \varphi(A_i) - \frac1m \sum_{i=1}^m \varphi(B_j)\Big\|.$$
You can compute this using only $k$ via the kernel trick,
but for large-scale problems it's easier to use a finite-dimensional approximation $z$ with $z(x) \in \mathbb{R}^D$
such that $z(x)^T z \approx \langle \varphi(x), \varphi(y) \rangle$.
A popular example of such an approximation was given by Rahimi and Recht, Random Features for Large-Scale Kernel Machines, NIPS 2007. If you use use the very popular Gaussian kernel $k(x, y) = \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)$,
then
$k(x, y) \approx z(x)^T z(y)$
where
$$
z(x) = \begin{bmatrix} \sin(\omega_1^T x) & \cos(\omega_1^T x) & \dots & \sin(\omega_D^T x) & \cos(\omega_D^T x) \end{bmatrix}
$$
for $\omega_i \sim N(0, \frac{1}{\sigma} I)$ (using the same $\omega$ values for each $x$).
(This isn't exactly the version given in the linked version of the paper, but this version is better.)
The estimate of the MMD is thus as above, except with $z$ instead of $\varphi$.
You'll need to pick an embedding dimension; for this scale of data, maybe 5000 is good.
You'll also need to pick a $\sigma$; a reasonable rule of thumb is maybe the median of the pairwise distances between elements of $B$.
Now, you want to find the subset of $B$ whose distribution is closest to that of $A$. We can write the problem by defining a vector $\alpha \in \{0, 1\}^m$ where $\alpha_i$ determines if $B_i$ is included or not:
\begin{align}\DeclareMathOperator*{\argmin}{\arg\min}
\alpha^*
&= \argmin_{\alpha \in \{0, 1\}^m} \Big\| \frac{1}{n} \sum_{i=1}^n z(A_i) - \frac{1}{1^T \alpha} \sum_{i=1}^m \alpha_i z(B_i) \Big\|^2
\end{align}
where we make sure to treat the case $\alpha = 0$ as infinite.
Now, let $Y_i = z(B_i) - \frac{1}{n} \sum_{i=1}^n z(A_i)$,
i.e. $Y_i$ is the distance of $B_i$'s embedding from our target vector,
and we want to find an assignment $\alpha$ that minimizes the mean distance:
\begin{align}
\alpha^*
&= \argmin_{\alpha \in \{0, 1\}^m} \Big\| \frac{1}{1^T \alpha} \sum_{i=1}^m \alpha_i Y_i \Big\|^2
= \argmin_{\alpha \in \{0, 1\}^m} \left\lVert \frac{Y \alpha}{1^T \alpha} \right\rVert^2
\end{align}
where $Y = \begin{bmatrix} Y_1 & \dots & Y_m \end{bmatrix}$.
The 1d, integer-valued data version of this problem is pretty close to subset sum, so it's probably NP-hard. We'll have to approximate.
If we fix $1^T \alpha$, then this becomes a binary quadratic program. There's been some work on solving these approximately, seemingly mostly (in a quick search) from the computer vision community. It still seems hard to solve, though.
But: in your case, I think you'd be happy with weights assigned to $B$, yes? You could then do weighted density estimation or whatever to try to figure out the distribution of the unknown components of $A$.
In that case, fix $1^T \alpha = 1$.
Our problem becomes
$$
\argmin_{\alpha \in [0, 1]^m}
\left\lVert Y \alpha \right\rVert^2
\quad\text{s.t.}\;
1^T \alpha = 1
.$$
This is a simple quadratic program with linear constraints, for which there are many solvers.
I actually just remembered that Hino and Murata, Information estimators for weighted observations, Neural Networks 2013 did something similar in their section 5.2, "Application to distribution-preserving data compression" using their nearest-neighbor-type estimator of the KL divergence. The problem there is more difficult both computationally and in terms of effort to code it up, though. (They don't say this, but it's actually a convex maximization, so they're not even guaranteed to get the global optimum.)
Another approach entirely is to think of this as a regression problem. Train a predictor for $(v_5, \dots, v_9)$ using $(v_1, \dots, v_4)$ as features on the training set $B$, and then apply the predictor to each data point from $A$. This is in some sense "more work" than what you're trying to do, but it might be more useful in the end. You could use transductive approaches to do so; Quadrianto, Patterson, and Smola, Distribution Matching for Transduction, NIPS 2009 even used an approach similar to the distribution matching scheme above to try to solve this problem.