Variable selection with tree-structured covariates?

Question

Let's say I want to do regression and that there's a categorical variable which has an inherent tree structure.

Using an example from my field of linguistics, let's say I'm trying to predict a binary response (say, passive vs. active voice use), and one of the predictors is the semantics of the agent (is it human or nonhuman, animate or inanimate, etc.). Instead of using known factors from the literature (the traditional way that linguists approach this problem), I'd like to infer the relevant semantic information using WordNet, a database that stores semantic relations. Hypernomy/hyponymy structure in the database would create a tree structure. I'd like to know the optimal amount of granularity in the agent's lexical semantics that will allow me to best predict whether a sentence uses the active or the passive, so I'm trying to do variable selection. For example, it might turn out I need a hugely complicated model where every single agent affects the log-odds of passivity in some way, or it might turn out all I need is an animate/inanimate binary distinction, or perhaps an animate/inanimate distinction followed by a human/animal distinction under animate.*

How would I go about doing this problem? One thing that immediately comes to mind is that I would have to remove all the 'only children' in the tree, since they would have perfect collinearity with their parent node. Then perhaps, instead of one-hot encoding, I can do the dummy-coding in a way that reflects the tree structures. (For example, if there's something shaped like this (A is not the root; this is a subtree):

      A
    /   \
   B    C
  /|\  
 D E F

my variables would be I(word is under A), I(word is under B), I(word is under D), I(word is under E).) Finally, after fitting the model with LASSO or elastic net for selecting variables, I take the dummy variables which remain and construct a decision tree for the effect of agent semantics (for better interpretability). (I'm not sure if this is the best way though.)

Both theory books/papers and actual case studies of similar problems (not necessarily linguistic) will be well appreciated. Thanks!

*That's simplifying how WordNet works of course; it's not a structuralist binary features dictionary but a database of lexical relations.

See my answer [here](https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302) — kjetil b halvorsen, Mar 16 '19 at 22:35
@kjetilbhalvorsen Thanks! I've put off this project for a bit, but I'll read your answer when I return to it. — WavesWashSands, Mar 18 '19 at 15:52
Does this answer your question? [Preprocess categorical variables with many values](https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values) — brazofuerte, Apr 29 '20 at 21:03

Variable selection with tree-structured covariates?

0 Answers0