From Insider Threat Detection
Jump to: navigation, search

Insider Threat Dataset

Link to the website: link


The CERT Division, in partnership with ExactData, LLC, and under sponsorship from DARPA I2O, has generated a collection of synthetic insider threat test datasets. These datasets provide both synthetic background data and data from synthetic malicious actors.

For more background on this data, please see the paper, Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data.


Datasets are organized according to the data generator release that created them. Most releases include multiple datasets (e.g., r3.1 and r3.2). Generally, later releases include a superset of the data generation functionality of earlier releases. Each dataset file contains a readme file that provides detailed notes about the features of that release.

The answer key file answers.tar.bz2 contains the details of the malicious activity included in each dataset, including descriptions of the scenarios enacted and the identifiers of the synthetic users involved.

Malicious Scenario

The dataset provides five explanation for the malicious scenarios:

  1. User who did not previously use removable drives or work after hours begins logging in after hours, using a removable drive, and uploading data to wikileaks.org. Leaves the organization shortly thereafter.
  2. User begins surfing job websites and soliciting employment from a competitor. Before leaving the company, they use a thumb drive (at markedly higher rates than their previous activity) to steal data.
  3. System administrator becomes disgruntled. Downloads a keylogger and uses a thumb drive to transfer it to his supervisor's machine. The next day, he uses the collected keylogs to log in as his supervisor and send out an alarming mass email, causing panic in the organization. He leaves the organization immediately.
  4. A user logs into another user's machine and searches for interesting files, emailing to their home email. This behavior occurs more and more frequently over a 3 month period.
  5. A member of a group decimated by layoffs uploads documents to Dropbox, planning to use them for personal gain.

Files Inside Datasets

In each dataset, there are logs and useful information stored respectively. The following are the directories or files in the dataset.


In the directory, there are files recording the list of employees of the month. In each file, it contains four different columns, employee_name, user_id, email, role.


This file records the behavior of file access on the device. Five columns are included: id, date, user, pc, activity. The activity contains only Insert and Remove.

email.csv & email-supplemental.csv

This file records the email transactions between employees. Five columns are included: id, date, to, from. Note that the email addresses in to and from may not be in LDAP. These employees may be someone in the other branches.


This file records the url visited by each employee. Five columns are included: id, date, user, pc, url.


This file records the logon and logoff activities by each employee. Five columns are included: id, date, user, pc, activity.


Session is a vector of features. It records the activity features by each user on each pc. Note that for the same user on different pc, it will go to different sessions. In each session, a list of features is included: login-id, user-id, pc-id, logon hour(0-24), duration minute(0-60*24), number of file creation, number of file deletion, number of email sent out, number of url visited, and a boolean value indicating if the user logon for the session.


Each numerical feature in the session is normalized by the maximal value of the feature. The maximal value is calculated locally for each user.