< November 2017 >
Su Mo Tu We Th Fr Sa
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30    

Note: Please login to download data.

Data Mining Tasks Description

Task 1: Android Malware Classification based on API information

This dataset is created from a set of APK (application package) files collected from the Opera Mobile Store over the period of January to September of 2014. Just like Windows (PC) systems using an .exe file for installing software, Android uses APK files for installing software on the Android operating system.

The permission system is applied as a measure to restrict access to privileged system resources and is considered as the first barrier to malware. Application developers have to explicitly declare the permissions in the AndroidManifest.xml file contained in the APK.

All official Android permissions are categorized into four types: Normal, Dangerous, Signature and SignatureOrSystem. As dangerous permissions have access to restricted resources and can have a negative impact if used incorrectly, they require user's approval at installation.

A set of APIs will be invoked during the runtime of the application. Each API is associated with a particular permission. When an API call is made, the approval of its associated permission will be checked. The execution of the API could only be successful in the case that the permission is granted by the user. In this way, the permissions are engaged in the process to protect the users private information from unauthorized access. API calls of the Android application exist

in the smali file, which can be obtained by reverse engineering tools such as apktool. The number of critical APIs ranges from a few hundred to thousands.

To be taken as the input of a machine-learning algorithm, permissions/APIs are commonly coded as binary variables i.e., an element in the vector could only take on two values: 1 for a requested permission and 0 otherwise. The number of all possible Android permissions/APIs varies based on the version of the OS. In this task, for each APK file under consideration, we provide a list of APIs obtained by reverse engineering the APK files The class label of the APK file -- +1 if it is regarded as malicious and -1 otherwise -- is determined by the detection results of security appliances hosted by VirusTotal. Note that adware was not counted as malware in our setting. The participants of CDMC 2017 competition are invited to design a classifier that matches this result best.

The dataset consists of API information for 61,730 APK files. The first half (30,897 files) of the dataset is used as training data provided with class labels, and the rest of the data (30,833 files) are used for testing. The total number of features constitutes up to 37,107 unique APIs: including official Android APIs and third-party ones.

Detailed information of the files is listed below:

1) Data file for training: CDMC2017AndroidAPITrainData.csv, 3.4M, 

  • Each line corresponds to an APK file in the training set.
  • The numbers sorted in ascending order on a line lists up the unique APIs parsed from the APK file.

2) Label file for training: CDMC2017AndroidAPITrainLabel.csv, 84K,

  • Each line constrains a number that shows the class label of the corresponding line of the data file.
  • +1 stands for a malware and -1 for a benign file.

3) Data file for testing: CDMC2017AndroidAPITestData.csv 3.4M,

  • Each line corresponds to an APK file in the test set. Class label information is not provided.

4) List of all 37,107 API names: CDMC2017AndroidAPIList.csv

  • The number in each line before the API names corresponds to the feature numbers in data file 1) and 3).

NOTE: to keep the completeness of the API set, some of them may not be in any of the APK files.

MD5 (CDMC2017AndroidAPITrainData.csv) = 05d39653c618a82fac38c43eb0429b63
MD5 (CDMC2017AndroidAPITrainLabel.csv) = 4f9fd94f4bf5abc44716bbc258c78876
MD5 (CDMC2017AndroidAPITestData.csv) = 3f737a7c2c2e443639efb2e2a9e93aca
MD5 (CDMC2017AndroidAPIList.csv) = 0f9b728afbaf16b11a9f0907420a6b0d

Reference: Tao Ban, Takeshi Takahashi, Shanqing Guo, Daisuke Inoue, Koji Nakao, Integration of Multi-modal Features for Android Malware Detection Using Linear SVM, The 11th Asia Joint Conference on Information Security (ASIAJCIS 2016), Fukuoka, Japan, Aug. 2016.

Task 2: Incident Detection over Unified Threat Management (UTM) operation on UniteCloud

UniteCloud is a resilient private Cloud infrastructure created in New Zealand, Unitec Institute of Technology using OpenStack for cloud orchestration and KVM for virtualization. UniteCloud is specializing in supplying eLearning and eResearch services for tertiary students and staffs.

This dataset is the operational log file that was captured from real-time running Unified Threat Management (UTM) on the edge of UniteCloud server, which accelerates and simplifies rule based threat detection & prevention, incident response and compliance management for our teams with limited resources. 

There are nine features for each sample, which correspond to operational measurements of 9 selected sensors under the UTM platform. The file is labeled accordingly by incident status determination over the collected log data. In the supplied training dataset, we provide 70,000 samples, and labels are supplied in a separated CSV file to indicate classes for the training dataset.

The goal of this task is to identify various incident accurately from ranges of sensor log files without high computational costs.

The statistical information of this dataset is summarized as:

No. of Sample No. of Features No. of Classes

No. of Training

No. of Testing

100,000 9 2 70,000 30,000

Reference: Shaoning Pang, Tony Shi, Ruibin Zhang and Denis Lavrov, 2017 CDMC Task 2: Incident Detection over Unified Threat Management (UTM) operation on UniteCloud, Unitec Institute of Technology, Auckland, New Zealand, 2017.

Task 3: Fraud Detection in Financial Transactions

Financial fraud is a long standing issue with broad reaching consequences not only in the finance industry but also other industry sectors especially the ordinary consumers.

The original anonymised data is from UAT or near production was provided by the financial institution, and it owned by Australia-based research laboratory: Internet Commerce Security Laboratory (ICSL) that focus on cyber-security and analytics.

During the process, this dataset is synthesized with highly correlated rule based uniformly distributed synthetic data (HCRUD) technique, with similar distribution of attributes. The difference of class distribution is less than .1% and distribution of main attributes like Transaction Type and Account Type is also less than .2%. The Difference of attribute distribution of combination of attributes is also very minor. We provide total of 100,000 transactions in sample from various account & transaction types, with 12 features for each transaction, and the strings in the last column indicate the three classes.

The statistical information of the dataset is summarized as:

No. of APK files No. of Features No. of Classes No. of Training No. of Testing
100,000 12 3 70,000 30,000


Reference: Internet Commerce Security Laboratory (ICSL), 2017 CDMC Task 3: Fraud Detection in Financial Transactions, Federation University Australia, Ballarat, VIC, Australia, 2017.