Can t-SNE help feature selection?

Question

I'm training a fully connected feed forward neural network for regression. Given one training example $(x_i, y_i)$, I need to convert the raw representation $x_i$ into an invariant representation $x_i^{'}$ of higher dimensions by some non-linear transformation $$x_i^{'} = f_{\theta}(x_i)$$ $\theta$ is the hyperparameter determines such a transformation, the dimensionalities of $x_i$ and $x_i^{'}$ are about $150$ and $4500$ respectively.

This transformation step cannot be accomplished by a neural network layer because the original representation is not invariant under translation. The original representation has to be transformed by some map controlled by $\theta$, and the map is constructed based on domain knowledge.

To choose a good hyperparameter $\theta$ is not easy, so I wonder whether dimensionality reduction can help me to make a good choice.

My idea is:

If a resulting representation $\{x_i^{'}\}_{i = 1}^N$ preserves much of the topological structures of the raw representation $\{x_i\}_{i = 1}^{N}$, then it is reasonable to believe it will be good for regression. The topological structure of the raw input can be probed by the domain knowledge, to probe the structure of the resulting representation, I decided to use t-SNE because of its non-linear nature.

However, by playing with t-SNE on this beautiful site, I found the result given by t-SNE is quite sensitive to the choice of perplexity, iterations and learning rate, and it seems only qualitative conclusions can be made by plotting the result.

Is my idea reasonable? Is there a better choice for a quick dimensionality reduction?

Could you clarify your "need" to convert the raw representation into something lower dimensional? A neural network will do exactly that, and likely better than tSNE as the training is supervised. If you're trying to cheaply and significantly widen your feature space, try a gradient-boosted-random-forest instead. — Alex R., Jan 11 '19 at 00:29
@Alex R I have explained why I need to convert the representation in the post, the reason is the representation has to be made invariant under translation, then the converted representation is taken to be the neural network input. — meTchaikovsky, Jan 11 '19 at 00:40
Why can’t the neural network use convoultional layers, which absolutely would be translation invariant. — Alex R., Jan 11 '19 at 01:56
@Alex R. I don't know too much about CNN, but I guess those are two different kinds of translation. The original data vectors here are Cartesian coordinates of the $n$ bodies consists of some system, so the translation I'm referring to is to translate the center of mass of the system. — meTchaikovsky, Jan 11 '19 at 02:04
Can you say more about what it means to 'probe the topological structure'? How exactly do you want to use the t-SNE outputs? — user20160, Jan 11 '19 at 15:44
I think the question: *[Why is t-SNE not used as a dimensionality reduction technique for clustering or classification?](https://stats.stackexchange.com/questions/340175)* is very relevant to this question. — usεr11852, Jan 12 '19 at 23:11

MightyCurious · Accepted Answer · 2019-01-15T14:29:51.087

No, I'm afraid your idea is not reasonable, for two reasons. Firstly, t-sne is a nice tool for decorative pictures, but it doesn't give you any reliable information to base decisions on. As you already noticed, it is very sensitive to the choice of parameters. Different parameters lead to different images which in turn lead to different conclusions regarding virtually all properties you might want to read off such a plot, and you have no way of knowing which of these parameter choices is the one leading to a correct conclusion (if any). You don't have to take my word for it; you can find a sufficient number of very simple examples here.

If you'd like a better alternative, you may want to think about why you need to evaluate the choice of $\theta$ at all. Your mapping goes from a lower-dimensional space into a higher-dimensional one. So it is plausible to assume that you're not loosing information, i.e., different choices for $x_i$ lead to different choices for $x_i'$ (of course, you have to check that with your concrete mapping).

This means that you might even be able to choose an arbitrary value for $\theta$ and just trust that your feedforward network will figure out the meaning of the resulting representation. So, in all likelihood we're not talking about a situation where you have to find one of very few values for $\theta$ to make your regression work at all. It is much more likely that the choice of $\theta$ has more of a gradual influence on the outcome: With some choices, the neural network will perform better than with others.

Of course choosing just one arbitrary value is just a thought experiment. In practice, I would recommend not to try and choose $\theta$ in a separate step before fitting the neural network, but rather treat it as an additional hyperparameter which has to be optimized along with the neural network parameters (such as number of layers etc.).

Can t-SNE help feature selection?

1 Answers1