1

I ask a group of people for the age of their 10 best friends, and will try to predict some output variable on basis of that. I don't ask them to rank these friends in any way, thus each person I ask gives rise to an unordered set of 10 input values, as well as some output variable. But as a feature vector is intrinsically ordered, I wonder what is the best way to map this unordered set into a feature vector.

I could sort the values, but the fundamental problem remains: there is little intrinsically common about, say, the third column that doesn't apply to the fourth column. Thus treating them as distinct dimensions seems questionable. Moreover, if the input values had been categories rather than numerical values, there wouldn't necessarily be any natural order to sort them by.

I could of course use summary statistics (mean, median, min, max, standard deviation, etc.) instead of the values themselves, but summary statistics inevitably leads to loss of information.

So my question is: Is there an established way of constructing a feature vector when the input values are naturally unordered?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
matthiash
  • 457
  • 1
  • 5
  • 12
  • Some related questions: https://stats.stackexchange.com/questions/63460/regression-mapping-an-unordered-set-of-tuples-to-a-scalar/335801, https://datascience.stackexchange.com/questions/30918/neural-nets-unordered-sets-of-ordered-tuples-as-features-of-data – kjetil b halvorsen Dec 03 '19 at 02:09
  • 1
    You could represent the variables as empirical distributions, then maybe look at https://stats.stackexchange.com/questions/tagged/functional-data-analysis – kjetil b halvorsen Dec 03 '19 at 02:12
  • Could you add (a link to) the data, so we can experiment? – kjetil b halvorsen Dec 03 '19 at 05:00
  • 1
    Yes, I think letting the feature dimensions be values or value ranges of an empirical distribution is the way to go. Similar to the bag-of-words approach in NLP. – matthiash Dec 03 '19 at 15:03

0 Answers0