Anomaly/Outlier detection based on Windows event security logs (logons) using Machine Learning(in Python)

Question

I am trying to solve the problem of finding anomalies/outliers using event security logs of an individual system. Please find the details below:

Problem Statement: Find anomalies/outliers using event security logs in an unsupervised learning environment. The basic use case is to find any suspicious activity by the user/group that deviates from a trend that the algorithm has learned.

Input Data: Data would be created from the log file that is in the following format as a .csv file containing 600 rows in the training set and 400 rows in the test set. (Both the training set and test set were collected over a week's (7 days) time)

The top features out of the 30 features are:

user_name , domain_name , logon_type , event_id , logon_date_time , computer , ip_address

Train Set: This comes under unsupervised learning, and hence we can't have a "normal" training set which the model can learn.

Example of anomalies:

User A suddenly accessing from a different ip_address
An excess of unusual logon_type
No. of access for a given key going up suddenly for a given user, key pair
Increased access on a generally quite long weekend
Increased access on a Thu (compared to last Thursdays)

Based on my research, I would need time-series Novelty detection where it uses the time/date feature to train the dataset based on the average number of logons that take place in that particular hour of the day and the day of the week.

Any feedback on the algorithm/technique used for the above use cases would be highly appreciated. Thanks.

score 0 · Accepted Answer · answered May 23 '18 at 16:25

It seems that your question is quite related to research I did in the last years, so here are my thoughts:

First of all, please have a look on my paper, where I descried outlier detection for such data.

Besides that, here are my thoughts on the topic.

You have a categorical (textual) data: IP address, username, etc. To process it, you may utilise one of the following approaches:
- create a custom numeric metrics/features, based on your knowledge of the data. For example, "number of failed logon events per user per X hours" or "number of failed logon events per user per server per day". Then you can more or less directly apply any existing outlier detection on it, e.g. k-means based. Another example would be kNN-based anomaly detection The algorithm is now available in Rapid Miner, but the implementation is not so efficient and may not work on larger data sets. Of course, such outlier detection would be rather basic and can only detect anomalies based on combination of custom features you created. E.g., if your custom features are just "number of failed logons per xxx", then you will only be able to detect anomalies in number of failed logon events, but not, for example, anomalies, such as high number of successful logon events.
- convert all categorical data into numeric form (one-hot encoding, which is also called conversion to the vector space model). The outlier detection algorithm which utilizes this technique is described in my paper. It is more generic and can detect some anomalies which data analyst does not expect to see, but may need more processing power and may have lower effectiveness.
- apply outlier detection that works directly on categorical features. An example can be a k-modes-based outlier detection. However, a k-modes implementation available online contains some problems that need to be fixed. This is also generic approach, but may also have higher computational complexity and lower effectiveness.
Your data set is quite small (1000 records in total). Then the outlier detection may suffer from high bias.
If you already know types of outliers that you want to capture, then there is probably no need in outlier/novelty detection at all. Just capture such cases with signatures or rules, or apply machine-learning based misuse detection (e.g., label your data first and then train a supervised classifier, such as Random Forest, on the labeled data).

All in all, I would suggest you to start with creation of custom numeric features. Next, you can apply a basic clustering-based outlier detection algorithm on it and see what you can get.

If the results will not satisfy you, then you can try more complicated algorithms.

Anomaly/Outlier detection based on Windows event security logs (logons) using Machine Learning(in Python)

1 Answers1

Linked