I'm trying to design a Mixture of Experts where we want only one active neural network at a time. Suppose that we have 10 experts. I want to train a MoE such that only one of the experts is active for a given feature vector.
How should I design this? Of the top of my head: perhaps one way is to have a normal gating mechanism -- use that gating mechanism to assign probabilities to each expert: then pick that expert.
However, the downfall to this approach is that the gating mechanism isn't trained to pick only 1 expert -- they were trained to work in cooperation with each other. So, if I used my proposed approach: I would make bad predictions.
TLDR: How do I design a Mixture of Experts that only has one expert active for a given input?