0

I want to create some model of human behavior. Basically - it's expected to answer for question if some particular user will agree or not agree for some action. Feature list for it is: user_id interacting_user_id some_normalized_value some_enumerated_value ....

Supposing usage of NN, how to standardize user_id and interacting_user_id?

This model is supposed to be used in system when number of possible values for user_id and interacting_user_id will be increasing over the time, to possibly quite big numbers. Is there any better option than creating separate pair of input layer neurons for each user?

Reason for putting user ids into features:

Let's suppose, that this is about trading service and we want to predict probability of doing business for some pair of users under certain conditions (date, price, .....).

I have not many information about particular user, so it would be hard to get some features like age, sex, or others, especially when it's hard to determine how relevant they are to the result. I suspect, that there is big variance between different users. I suspect there are some social interactions between users, so some specific pairs of users can produce significantly different results than average due to their specific relationship. Most important - I want to combine somehow knowledge about population behavior with knowledge about specific user's behavior. So for example - if there is no knowledge about specific user model will predict using population knowledge, as soon as some behavior history will be recorded it'll start to use more user specific data.

piotrpo
  • 158
  • 1
  • 6
  • The user id should play no role in predictions. This is absurd! Features describing a user is what your neural network is trained on. – Arun Jose Oct 04 '16 at 12:13
  • That's not necessarily true, @Arun. If there's information contained in the user id (e.g. location or age of the user), and that information is not yet captured by any features, could certainly use user id. – blacksite Oct 04 '16 at 13:46
  • I explained why there is the user id. – piotrpo Oct 04 '16 at 14:37

1 Answers1

2

Basically, you want to use both categorical variables (id) and some continious standardized variables. This similar question gives insight in this issue. But more importantly, why do you want to include id? Do you expect multipe observations per id? And do you expect the user id to have any predictive value?

Ivo
  • 411
  • 2
  • 8
  • I explained why there is the user id. – piotrpo Oct 04 '16 at 14:37
  • So, do you expect of have multiple observations of the same id? If not, there is no value in having id in your model. It's a categorical thing and there is no information in the category itself unless the user shows similar behavior in multiple observations. Also, how do you expect your model to behave when a new user (with an id, unkown to the model) enters the model. – Ivo Oct 04 '16 at 14:41
  • Yes I expect to have multiple observations ot the same id. However - I'm not sure that there will be some observation at all for some specific id and it's sure that there will not be enough observations for specific id to cover "all" situations represented by rest of features. – piotrpo Oct 04 '16 at 14:53
  • In that case. Check the link in my answer. – Ivo Oct 04 '16 at 14:58
  • Thanks, I just wondered if there is a way to avoid creating 1M feature wide model. – piotrpo Oct 04 '16 at 15:07
  • Not if you want to include id number, I think. If you have idea's about how groups of people differ in their behaviour you could try to classify them by it (with some classification model). But that makes use of the same variables as the NN so I don't see any added value of that. – Ivo Oct 04 '16 at 15:10
  • @piotrpo: that's the same point I raised in the earlier comment. The ID simply acts a switch. If you want user id level information you will end up with an enormous matrix. Not impossible, but extremely impractical and expensive, should you have another competing technique that can do without. – Arun Jose Oct 05 '16 at 05:25
  • @ArunJose Sure I want to fine another approach. The problem is, that for most cases "personalized" data will be poor - so creating lot of models has no point. On other hand - still there is high variance as we all have different habits, likes, dislikes etc. so there is a need to use this data in person specific context. – piotrpo Oct 05 '16 at 07:17