How to design Mixture of Experts where we want only one active model at a time?

Question

I'm trying to design a Mixture of Experts where we want only one active neural network at a time. Suppose that we have 10 experts. I want to train a MoE such that only one of the experts is active for a given feature vector.

How should I design this? Of the top of my head: perhaps one way is to have a normal gating mechanism -- use that gating mechanism to assign probabilities to each expert: then pick that expert.

However, the downfall to this approach is that the gating mechanism isn't trained to pick only 1 expert -- they were trained to work in cooperation with each other. So, if I used my proposed approach: I would make bad predictions.

TLDR: How do I design a Mixture of Experts that only has one expert active for a given input?

Take a look at these works: [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961), [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538). There the authors describe ways to achieve sparsity in the gating weights. — AlexGj, Apr 20 '21 at 06:29

score 2 · Answer 1 · answered Mar 20 '19 at 10:04

You can approach this using sparsity-aware regularization.

Let's say your algorithm outputs $$h(x) = \sum_i{\alpha_i f_i(x)}$$

You want to minimize the size of the set $\{\alpha_i : \alpha_i \neq 0\}$, which is $\|\alpha\|_0$

Your problem has the form of

Minimize $err(h(x), y)$ with $\|\alpha\|_0 \leq 1$

In dual form, for appropriate $\lambda$

Minimize $err(h(x), y) + \lambda \|\alpha\|_0$

Unfortunately that optimization is very hard in general, since $\|\alpha\|_0$ is very bad-behaved function. But there exist alternatives - for example since $\|\alpha\|_1$ can be thought of as convexification of $\|\alpha\|_0$ you might use LASSO (you'd need to find appropriate $\lambda$ yourself though). You could also try using some Matching Pursuit (it seems more in tune with what you want since it's a greedy method, and you'd need to only run one step).

How to design Mixture of Experts where we want only one active model at a time?

1 Answers1

Linked