What are state of the art methods for creating embeddings for sets?

Question

I want to create embeddings in $R^D$ for sets. So I want a function (probably a neural network) that takes in a set $ S = \{ s_1, \dots, s_n \} $ (and ideally of any size, so the number of elements might vary but anything is good) and produces vector. Ideally, the set embedding function is ordering invariant (the way sets are) so a straight LSTM isn't quite what I want (since thats for sequences), unless modified and ideally referenced in some published paper.

$$ f_{\theta}(S) = e_S \in R^D$$

what are state of the art (SOTA) methods for this task?

The silliest method I know is just embed each element seperately and then take the sum, so:

$$ f_{\theta}(S) = \sum_i g(s_i) $$

or perhaps better with some sort of attention:

$$ f_{\theta}(S) = \sum_i \alpha(S) g(s_i) $$

but ideally if something is already a paper then it's already been tested better than my random idea...

BTW, the only thing I am aware of is in this paper: https://arxiv.org/abs/1606.04080 but seems rather old (2016) and as of the writing of this question we are 2020.

score 3 · Answer 1 · answered Mar 15 '20 at 13:24

Indeed, in the last to-three years there have been some important publications on this topic. I do not know all of them, and cannot give a complete survey of the current status. One important paper though is Deep Sets, which presented a canonical architecture for dealing with such problems.

The main problem is to find an architecture that is able to deal with an input sequence of variable length, whose ordering is irrelevant, and the output of the network should be invariant to permutations of it.

It is shown that one can achieve this with a set function of the form

$$ f(\mathcal{X}) = \rho\left( \sum_{x \in \mathcal{X}} \phi(x) \right) $$

The idea is that we embed each set element with feature mapping $\phi$, and aggregate the embeddings into an invariant, fixed-size description. In this case, it is a sum, but it could be any other permutation invariant operation (at least in general). Then a final function $\rho$ processes the aggregate.

This is represented in the following figure

Thanks a lot, that's helpful to me. But the reasoning in that paper seems to have a problem. First, in the continuum-case it made the restriction that the # of elements is fixed and finite. This is an acceptable restriction. Then it proves that any invariant function must be of the form of the formula you cited. But in practice, when the theorem is applied, $\rho$ would be FIXED and then there'd be no guarantee that such a family of functions is dense in the function space desired. In other words, once $\rho$ is fixed then there may be some invariant functions that can't be approximated. — Yan King Yin, Mar 27 '20 at 03:53
do you know how that compares to transformers (without positional encodings)? seems they have tried it! https://arxiv.org/abs/1810.00825 — Charlie Parker, Mar 08 '21 at 01:00

What are state of the art methods for creating embeddings for sets?

1 Answers1