1

I am working on an anomaly detection application that uses keystroke dynamics.

This is the pool of features that I have to my disposal:

  1. hold time = R(i) - P(i)
  2. key-up to key-down = P(i+1) - R(i)
  3. key-up to key-up = R(i+1) - R(i)
  4. key-down to key-down = P(i+1) - P(i)
  5. key-down to key-up = R(i+1) - P(i)

Where,

  • P(i) is the press time of the current key
  • R(i) is the release time of the current key
  • R(i+1) is the release time of the consecutive key
  • P(i+1) is the press time of the consecutive key

I am aware that the "best" features will be the ones with high variance.

What statistical method(s) can I employ for selecting the "best" features?

TMK
  • 11
  • 1
  • If you believe that the best features are the ones with the highest variance, you can just calculate their variances and sort them. But I don't understand why you believe this. – mkt Apr 06 '18 at 04:43

1 Answers1

0

You could employ a GLM that calculates the internally importance of each feature, for example lasso or ridge. The most common one, as far as I'm aware is lasso.

You might also want to read this which explains the procedure with more detail. Or this excellent tutorial which explains both lasso and ridge and how they are used in python.

Djib2011
  • 5,395
  • 5
  • 25
  • 36