Task 1: KISTI+IDS2021-CDMC: Network Intrusion Detection

Over the last decade, various research studies had been carried to construct a robust Intrusion Detection System (IDS). Even though, advanced network attacks were raised with developing communication technologies, old-fashion datasets (KDD '98, '99) are still used for the IDS research. KISTI-IDS-2021 is the collection of real-world network IDS alerts that contain 25 types of attacks and payload information. The dataset has 153,829 IDS alerts which are labeled as malicious(label:1) and benign(label:2) respectively. Also, all of the words in payload are embedded as a word vector to 100-dimension for each word. The purpose of the dataset is to encourage IDS research in terms of real-world environments considering payload contents. It is divided into training (80%) and test set (20%) with 4-columned TAB separated structure (idx, category, wordvector(payload), label).

Note: Please login to download data.

Task 2: Maldataset 2021 Maldataset as png images

Maldataset2021 is a malware dataset that consists of 28 classes of malware, in which each class represents a malware family, and each sample gives a RGB 224x224 PNG file. The PNG files are transformed from the original binary malware files. The motivation of image transformation is to identify malware on the raw bytes of entire executable files (i.e., image), so that deep learning technologies such as CNN can be applied to malware classification, since CNN model has been demonstrated with its outstanding capability on image classification. In this view, we provide here a new dataset that contains the latest malware samples. The entire PNG files are split as, 70% for training and the remaining 30% for testing.

Malware Classes Number of Samples
Training Set Testing Set
Agent 350 120
Agenttesla 85 35
Androm 350 147
Andromeda 85 35
Autorun 350 147
Autorun.k 80 25
Azorult 35 10
Cerber 70 30
Darkcomet 45 25
Dridex 30 15
Dyre 41 18
Emotet 68 26
Grandcrab 73 21
Hawkeye 70 21
Heyodo 69 30
IceID 69 20
Limerat 10 6
Loki 138 40
Nanocore 157 42
Neshta 350 147
Nymaim 73 30
QuasarRat 92 35
Regrun 350 147
Remcosrat 155 70
Robot!gen 140 18
Sality 350 147
Shifu 31 12
Trickbot 141 69
Total Number 3604 1365

The PNG files are of type 3D, therefore, it was saved as a NumPy .npy (RGB) and a .csv (Gray Scale) file, respectively. You can use either or both types of files for the classification. 



Chaalan Tarek Maldataset 2021: Maldataset as png images, June 2021

Note: Please login to download data.

Task 3: CDMC2021 IoT Malware Detection

Based on the control flow graphs (CFGs) generated by a static-analysis tool, Radare2, and labels that indicating whether the samples are malware programs, the participants are required to perform an IoT malware detection task to predict whether the samples in the test set are malware or not. The dataset consists of 54,829 samples, which are generated from the following procedure: (1) a collection of malicious and benign Linux programs in ELF format were collected from various sources; (2) each of these programs are fed to Radare2 to extract the CFG information; and (3) JSON output from Radare2 that can be interpreted as a list of directed-graph components are then reformulate as a single line in a text file. Please see the “File Format” section for more detail.

Label (1: malware, 0: benign ware) of the ELF files are determined by the state-of-art anti-virus engines.

List of Files

The CDMC2021_IoTMalware_Train.data file contains feature information of 16,521files in the training set.
The CDMC2021_IoTMalware_Train.label file contains label information of 16,521files in the training set.
The CDMC2021_IoTMalware_Test.data file contains information of 38,550 files in the testing set.

File Format

Steps to formulate the features.

  1. Radare2 outputs its analysis result for an ELF sample program as a JSON object looks like the following.

[{"name": "sym.__uClibc_main", "imports": ["sym.memset", "sym.__GI_memcpy", "sym._dl_aux_init", "sym.__uClibc_init"]}, {"name": "sym._fp_out_narrow", "imports": ["sym.__GI_strlen    ", "sym._charpad", "sym.__stdio_fwrite"]}, …]

  1. Then, each node in the list is represented as a list of function calls with the “name” field placed at first, followed by the function calls in the “import” field. The components in the list are separated by white spaces. The JSON object above is changed to a list of nodes as follows.

Node 1: "sym.__uClibc_main" "sym.memset" "sym.__GI_memcpy" "sym._dl_aux_init" "sym.__uClibc_init"
Node 2: "sym._fp_out_narrow" "sym.__GI_strlen" "sym._charpad" "sym.__stdio_fwrite"
Nodes 3~: …

  1. All nodes in the JSON list are sequentially joined by semicolons to form a single line in a .data file. Now, each line in the .data file corresponds to a single file in the dataset.

Line 1: "sym.__uClibc_main" "sym.memset" "sym.__GI_memcpy" "sym._dl_aux_init" "sym.__uClibc_init";"sym._fp_out_narrow" "sym.__GI_strlen" "sym._charpad" "sym.__stdio_fwrite";…


The participants are required to provide the prediction of labels of the test samples based on information provided in the task.

Credits and Clarifications

The original analysis result of IoT malware classification task was kindly contributed by Taiwan Information Security Center (TWISC). The data was processed by the CDMC 2021 committee with all sensitive information removed.

Note: Please login to download data.

< January 2023  
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31