Malicious software datasets

Welcome to the CSDMC2010 API sequence corpus, which is one of the datasets for the data mining competition associated with ICONIP 2010.

This dataset is composed of a selection of Windows API/System-Call trace files, intended for testing on classifiers treating with sequences. 

Pertinent points

- A subset of the Windows APU/System-Calls which are considered informative for differentiating a malware from a benign software are logged by API monitors when a designated program is running in the system.
- For simplicity, only the names of the APIs are given presented in the log file -- without noting the calling-process.
- For completeness, reduplicated calls of the same API are all recorded, which could result in some redundacy in the log though.
- Malware samples are labeled by the state-of-the-art anti-virus software. Although there are certain subcategories such as worm, trojan, virus, etc, we group all these malicous software types as the 'malware' group, which are assigned labels as '1'. The remaining benign software programs are assigned label '0'.
- The order of the system calls are preserved as good as possible. I can not say that this order is perfectly preserved, especially, the mixing up may cost by multi-threading. Another reason for not being able to keep the order is that the data collection and processing are done by multi-party, with difference soft tools. Anywary, efforts have been taken to make the ordering of the sequences not missleading, by keeping the original order of the records as in the log when time information is not available, or by sorting the sequence according to the timestamp when possible.

The corpus file -- CSDMC2010_API.tar.bz2

On Linux platforms, it can be extracted by command tar -xjf CSDMC2010_API.tar.bz2 -C email/

In an MS Windows environment, use the bzip2 software

The corpus description
The dataset contains two parts:

- TRAINING: 388 logs out of which there are 320 malware traces labeled as '1' and 68 benign software traces labeled as '0'.
Each line in the data file corresponds to a trace of a designated software. The label is given at the beginning of a line, with a comma seperating the label and the corpus.
- TESTING: 378 traces with unknown labels -- labeled as all 0's in the file.

Please direct any questions regarding this dataset to