< October 2019 >
Su Mo Tu We Th Fr Sa
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    

Note: Please login to download data.

Data Mining Tasks Description

Task 1: SADAVS-Sensor Array Data for Autonomous Vehicle Safety

Please note that, there are TWO scenarios in this task.

Vehicle-based accident detection systems monitor a network of sensors to determine if an accident has occurred. Instances of high acceleration/deceleration are due to a large change in velocity over a very short period of time. In the context of autonomous car, the speeds are hard to attain since a vehicle is not controlled by a human driver. The presented data captured originally in New Zealand gives a collection of a sensor array (160x144) values in monitoring the status of moving vehicle. The objective of this competition task is for early detection of any potential road accidents in two different scenarios.

The statistical information of this dataset is summarized as:

Scenarios A:

No. of Classes No. of Training
No. of Testing
2 973 416

Scenarios B:

No. of Classes No. of Training
No. of Testing
2 973 850

Note that

The captured sensor array data is a matrix of float numbers with the size of the matrix as( row = 144, column = 160);
The data is stored in “.snr” file, which can be read as a pure text file. The file name represents the time point of data capture;
The data structure of a training file is given as: training data = header + data
header = "# x,", here x =1 (safe) or -1 (No-safe)

The data structure of testing is given as (testing data has no class label, thus has no header)


Reference: Shaoning Pang and Brook Huang, sensor array data for autonomous vehicle incident detection, the 10th International Cybersecurity Data Mining Competition (CDMC2019), June 2019

Task 2: IoT malware classification

The aim of this task is to classify IoT malware. The features provided to perform the classification are the sequence of system calls captured during the runtime of malware in an sandbox environment. 

The dataset contains two parts:

TRAINING: 4167 formatted sequences of system calls, labeled by the type of the malware. 

TESTING: 4275 files without known class labels.

NOTE the following difference between the training and test sets. For the training set, the label of each sample (find detail information of a sample file below) is provided in the label file, whilst the TEST.label file for competition evaluation is preserved for future use. 

This dataset consists of 8442 samples generated following the procedure below. 

First, a collection of potentially malicious Linux programs in CEF format are collected from various sources. 

Then, each of these programs is executed in an sandboxed environment hosted by an emulator that provides the required runtime environment for it. During the runtime, the strace command is used to monitor and record the interactions between the processes initialized by the program and the Linux kernel. This process yields a log file that contains lines of system calls. On each line, strace records the time stamp, the invoked system call, as long as parameters and results of the calls.

These log files are parsed and reformatted in a simplified format as in the .seq files. The title of a .seq file indicates the sample (i.e., a malicious program) index in the dataset. 

There might be multiples lines in a .seq file, with each line stands for the sequence of system calls invoked by a particular process initialized by the malware. The system call in each line are presented in ascending order of the function call time. The processes are presented in ascending order of the creation time.


All the .seq files used for training a prediction model can be found in the "TRAIN" folder. 

All the .seq file used for evaluating a prediction model can be found in the "TEST" folder. 

Along with the .seq file, there is also a TRAIN.label file provided in the following format. 

The TRAIN.label is a comma-separated values file. The first column is the index of the sample in the training set, and the second column presents the class label of the corresponding sample. For instance,  "1111,5" indicate the number 1111 sample (i.e. file 1111.seq) in the TRAIN folder belongs to class 5. We preserve the lexical meaning of the labels for fairness reason. 

Credit: This IoT malware classification 2019 dataset is provided by Taiwan Information Security Center (TWISC).