I am trying to solve the problem of finding anomalies/outliers using event security logs of an individual system. Please find the details below:
Problem Statement: Find anomalies/outliers using event security logs in an unsupervised learning environment. The basic use case is to find any suspicious activity by the user/group that deviates from a trend that the algorithm has learned.
Input Data: Data would be created from the log file that is in the following format as a .csv file containing 600 rows in the training set and 400 rows in the test set. (Both the training set and test set were collected over a week's (7 days) time)
The top features out of the 30 features are:
user_name
, domain_name
, logon_type
, event_id
, logon_date_time
, computer
, ip_address
Train Set: This comes under unsupervised learning, and hence we can't have a "normal" training set which the model can learn.
Example of anomalies:
- User A suddenly accessing from a different
ip_address
- An excess of unusual
logon_type
- No. of access for a given key going up suddenly for a given user, key pair
- Increased access on a generally quite long weekend
- Increased access on a Thu (compared to last Thursdays)
Based on my research, I would need time-series Novelty detection where it uses the time/date feature to train the dataset based on the average number of logons that take place in that particular hour of the day and the day of the week.
Any feedback on the algorithm/technique used for the above use cases would be highly appreciated. Thanks.