Insider Threat Detection using kNN on Session Feature Vectors
Preliminary
Session Feature Vectors
Session is a feature array recording behaviors preformed by an user in a day. The session starts when a user logs in, or when the user performs his/her first action (a session without log in); ends when the user logs out, or at the end of a day (12 AM). We don't consider time out for each session, but we conclude a session at the end of a day, and generate another session on the next day. Note that if the previous session wasn't ended because of user log off, the session generated for the next day will inherit features from the previous session, starting at 12AM. For example, a user logs in on 3/6 at 6:30AM, and logs out on 3/8 6:10PM. Three sessions are generated for the user: 3/6 6:30AM-12AM, 3/7 12AM-12AM, and 3/8 12AM-6:10PM.
k-Nearest Neighbor
<math>F_S(s_i, s_j)</math> is a similarity metric measuring the similarity between any two elements, <math>s_i</math> and <math>s_j</math>, where <math>s_i, s_j \in \real^d</math> and <math>d</math> is the dimension. Suppose that we have observed <math>\eta</math> elements, <math>{\cal D} = \{s_1, s_2, \cdots, s_{\eta}\}</math> and we monitor a new element <math>s'</math>. Let <math>d_i</math> be the similarity measurement between <math>s_i</math> and <math>s'</math>, <math>d_i = F_S(s_i, s')</math>. Let <math>NN_{\nu}(s', {\cal D})</math> be the <math>\nu</math> nearest neighbor of <math>s'</math> in <math>{\cal D}</math>; namely, the <math>\nu</math> elements in <math>{\cal D}</math> with the most similar to <math>s'</math>. The <math>NN_{\nu}(s', {\cal D})</math> can be expressed as follow: <math> \vert NN_{\nu}(s', {\cal D}) \vert = \nu NN_{\nu}(s', {\cal D}) = \{s^* \in \real^d \vert F_S(s^*, s') < F_S(s, s'), \forall s \in {\cal D} \} </math> The <math>kNN(s', {\cal D})</math> is the average distance of the <math>k</math> nearest neighbors of <math>s'</math> in <math>{\cal D}</math>. <math>kNN(s', {\cal D}) = \frac{\sum_{s \in NN_{\nu}(s', {\cal D})} F_S(s, s')}{k}.</math>
Distance-Based Stochastic Estimation
Borne proposed an algorithm on outlier detection using k-nearest neighbor data distribution. Given a new data element <math>x<math>, we first calculate <math>F_x \in \realnum^k</math>, the distance between <math>x</math> and <math>kNN(x, {\cal D})</math>; we also calculate <math>F_k \in \realnum^{\frac{k(k-1)}{2}}</math>, the distance among <math>kNN(x, {\cal D})</math>. The cumulative distribution functions <math>C_x</math> and <math>C_K</math> are calculated for <math>F_x</math> and <math>F_k</math>.
Let <math>F_S(F_x, F_k) \in [0, 1]</math> be the two sample test on <math>F_x</math> and <math>F_k</math>. We take <math>1-Dist(F_x, F_k)</math> as the anomaly score of the new element <math>x</math>. Two sample test is a stochastic test by comparing two samples from different probability distribution. Kolmogorov–Smirnov test, Mann-Whitney U Test, and Student t test, are examples of popular two sample tests.
This algorithm compares the cumulative probability distributions and calculates the similarity between them. It considers the distance distribution of the data elements, instead of looking at the distance value alone. Even if a data element is with high distance from the other elements, the anomaly score can be low when all the elements are dispersed among the space.
Dataset
r2 Dataset
From our observation, the log files are not sorted by timestamp. We first sort each file according to the timestamp of the event. We partition each file in the dataset, r2, according to their role and user_id (referenced from LDAP). Then, we sort each user or role file in increasing order of timestamp.
file name | 1st column | 2nd column | 3rd column | 4th column | 5th column |
---|---|---|---|---|---|
LDAP | employee_name | user_id | role | ||
logon.csv | id | time | user_id | pc | activity(Logon/Logoff) |
device.csv | id | time | user_id | pc | activity(Insert/Delete) |
email.csv | id | date | to | from | |
http.csv | id | time | user_id | pc | url |
Table 1 presents the columns of attributes in each file for r2 dataset. In LDAP, the employee information is recorded. There is an LDAP file for each month. We can look up the employee name, its id, email address, and role in the company in each month from LDAP. In "logon.csv", the log on and log off behavior is recorded. We can get the time of log on and log off for each user of each pc. We use the log on and log off in this file as the begin and the end of the session (we will discuss session in the next paragraph). In "device.csv", the file creation and deletion is recorded for each user on each pc. For r2, we don't know which file is created or deleted, either if the user performs copy and write. In email, including "email.csv" and "email-suplemental.csv", the sender and receiver of the email is recorded as follows. Int "http.csv", the url visited by each user is recorded as well.
Session Feature Vectors
The following are the features we extract from r2 dataset:
- session id
- user id
- pc
- the hour when the session begins
- the duration of the session
- number of file creations
- number of file deletions
- number of emails sent out
- number of urls visited
We extract the session id and the hour of the user log on from "logon.csv". We also keep the time of the user log on in the memory. When observing "Logoff" for the user on the pc, we then calculate the duration of the session. Every session without ended by Logoff is called active. We put active sessions in HashMap, and set the user-pc pair as the key. Note that, each session is for a user on a pc. If an user uses multiple pc at the same time, multiple sessions are generated for its behavior on different pc.
The number of file creations and deletions is counted from "device.csv". The number of urls visited is counted from "http.csv". The way we deal with the number of emails sent out is different. There is no pc information in email dataset. We lookup all active sessions for the user, and pick the first session to take the responsibility for the emails sent out.
We generate 375,678 sessions among 1,000 users from r2 dataset. r2 records the log from Jan., 2010 to May, 2011. "ONS0995" is the only malicious insider in the dataset.
r3.1 Dataset
file name | 1st column | 2nd column | 3rd column | 4th column | 5th column | 6th column | 7th column | ||
---|---|---|---|---|---|---|---|---|---|
LDAP | employee_name | user_id | role | business | functional_unit | department | team | supervisor | |
logon.csv | id | date | user_id | pc | activity (Logon/Logoff) | ||||
device.csv | id | date | user_id | pc | activity (connect/disconnect) | ||||
email.csv | id | date | to | from | size | attachment_count | content | ||
http.csv | id | date | user_id | pc | url | content | |||
file.csv | id | date | user_id | pc | filename | content |
Session Feature Vectors
Software Architecture
Fig. 1 shows the software architecture of our anomaly detection model. Our system absorbs multiple log streams and output anomaly sessions. The essential components are described as follow:
- Event Consolidator: This component absorbs multiple log streams. It merges and sorts them according to timestamp, and generate a single composite event stream.
- Event Partitioner: This component divides the consolidated stream according to their category. The possible categories can be the user id, the user role, the user group and so on.
- Stream Segmentation: The segmentation cuts the stream into logical chunks so that we can analyze their normal behavioral pattern. We call the chunk as "Session".
- Streaming Machine Learning and Streaming Anomaly Detector: Learn the normal patter and detect anomalies from the session streams.
Learning Method
kNN
We take the anomaly score as the average distance from the k nearest neighbors of a session. If the anomaly score is greater than a threshold, an anomaly alert is reported.
Distance-Based Stochastic Estimation
As it is hard to determine the threshold in the previous method, we take the advantage of two sample test, converting distance value to probability. Details can bee seen in the above section.
Java Implementation
The project is implemented in Java. The source code is in Github repository [Link]. The essential functions and components are described as follow:
- Event Consolidator and Event Partitioner:
- "SplitByUserGroup": Absorb multiple log sources, merge and sort them, and split the stream according to user id and user role.
- Stream Segmentation:
- "SessionNode": Store the events in a session, and output the feature vector in the sessions.
- "CERTLog": An interface describing how event logs are parsed into Session.
- "Activity": An Enum describing all the possible activities.
- "R6_2CERTLog": An example of realization of CERTLog.
- Streaming Machine Learning:
- "FindKNN": An component calculating the kNN for a session.
- "KNNSet": The result come out from FindKNN.
- "TwoSampleTest": A component tests two populations if they are similar, implementing KSTest, MWWTest, t Test.
- Streaming Anomaly Detector:
- "SessionAnomalyCalculator": Calculate the anomaly score and come out a list of anomaly sessions.