Quick Navigation
DPATeX2025: DPA Defence via AI Perturbation Core Understanding
Task Overview
In the evolving landscape of cybersecurity, Data Poisoning Attacks (DPA) have become a massive and continuously growing threat, leveraging sophisticated techniques to deceive users and bypass traditional defenses. Defending against DPAs is particularly challenging due to their sheer volume and the difficulty in distinguishing them from legitimate content, as attackers frequently alter surface features (e.g., URLs, text) while retaining the malicious core. To effectively detect DPAs, it is essential to identify their perturbation core—the underlying mathematical function or algorithmic logic that generates deceptive variations. In this context, AI-driven defense offers a promising solution, but its success hinges on training models to deeply understand and recognize the perturbation core, enabling robust detection even as attack variants evolve. This competition task, DPA Defence via AI Perturbation Core Understanding, challenges participants to develop innovative AI approaches that uncover and neutralize the foundational mechanisms of DPAs.
Dataset Overview
To address this challenge, we construct the DPATeX dataset. We begin by manually collecting 257 perturbation cores (PCs) from the literature, each expressing a core adversarial mechanism, and storing them in LaTeX format. These curated PCs serve as seeds to prompt a large language model (LLM) to generate 256 additional perturbation cores. The generation process adheres to strict control criteria: each generated core must belong to the same mathematical family as its seed, remain distinct, and preserve its adversarial effectiveness. The resulting DPATeX dataset comprises a total of 513 perturbation cores, categorized into seven mathematical classes: Gradient, Norm, Hybrid, Machine Learning (ML), Geometric, Bayesian Optimization, and Gradient-Free. Each sample is labeled and formatted in LaTeX, providing a structured resource for advancing AI-based defense research.
Each sample in the dataset consists of either a mathematical formula or an algorithm. Each DPATeX sample is stored in a LaTeX-based file, organised into four main sections:
Input: Defines the mathematical input variables required for the perturbation function.
Output: Specifies the perturbed mathematical output after applying the perturbation function.
Formula: Contains the detailed mathematical formulation written in LaTeX. This section defines the precise operations, transformations, and perturbation functions used to generate the output.
Explanation: Provides a descriptive clarification of the formula, offering additional insights into the perturbation function and its role in data poisoning.
Task Description
This competition task is to distinguish machine generated PCs from the seed PCs, which are developed by humans. The full dataset consists of the seed PC dataset DPATeX, machine generated DPAGenTeX, and the merged dataset. The statistical information of the dataset used for this competition is given as:
| Dataset | Samples/Files in Two classes 0 is seed, 1 is machine generated |
|---|---|
| Train | 410 |
| Test | 103 |
The name of file identifies the label of the seed (human complied) PC versus machine generated PC as "file name_0.tex" and "file name_1.tex", respectively.
Results Submission
You need to submit a CSV file that contains the test sample predictions. The expected result submission should be a CSV file with 2 columns with the first column as the file name and second column as the class label.
Citation of this Dataset
Chaalan et al., DPATeX2025: a dataset for DPA Mathematical Perturbation Core, for the 16th International Cybersecurity and generative competition (CyberAI Cup 2025) hosted by International Cyber-security Data-mining Society (ICSDS).
IoT Malware Classification via Function Call Graph Analysis
Task Overview
In the rapidly expanding Internet of Things (IoT) landscape, securing connected devices has become a critical challenge. IoT malware poses a significant threat by exploiting vulnerabilities in resource-constrained devices at scale. Automated and accurate analysis of IoT malware is essential for effective cybersecurity defences. This competition task, IoT Malware Classification via Function Call Graph Analysis, challenges participants to develop robust classifiers capable of detecting and categorizing IoT malware samples based on their structural behavior patterns.
Dataset Overview
To facilitate this task, we provide a dataset comprising Function Call Graphs (FCGs) extracted from real-world IoT malware samples targeting ARM CPU architectures. Each sample captures the dynamic interaction between malware functions, represented as:
- .data file: Contains FCG representations where:
- Each line corresponds to one malware sample
- Edges are defined as 2-tuples of function names (caller#callee), separated by spaces
- Example: funcA#funcB funcC#funcD represents a graph with two edges: funcA → funcB and funcC → funcD
- .label file: Provides supervised labels for training, with one class label per line (aligned with the .data file entries).
The dataset is partitioned into:
| Dataset | Samples |
|---|---|
| Train | 1,397 |
| Test | 1,397 |
Task Description
Participants must:
- Train a classifier using train.data and train.label to learn malware categories from FCGs.
- Predict labels for unseen samples in test.data, generating one label per line.
- Submit results in a text file mirroring train.label format.
Evaluation Focus:
- Accuracy in classifying malware families
- Scalability to large FCGs
- Robustness to obfuscation techniques
Submission Guidelines
Submit a .CSV file with one predicted label per line for test.data, e.g.:
| mirai gafgyt gafgyt rootnik mobidash tsunami ddostf |
Technical Hints
- Basic Approach: Preprocess edges (replace # with spaces) to extract caller-callee pairs.
- Advanced Methods: Leverage graph topology (e.g., node degrees, connectivity patterns) or graph neural networks (GNNs) for higher accuracy.
Citation
Shin-Ming Cheng and Tao Ban, Dataset for IoT Malware Classification via Function Call Graph Analysis, for the 16th International Cybersecurity and generative competition (CyberAI Cup 2025) hosted by International Cyber-security Data-mining Society (ICSDS).
PCTR: Photographed Chinese Table Reasoning (Optional)
Task Overview
Table captured in real-world applications are often under suboptimal conditions, such as blur, shadows, tilted angles, and lighting variations. This poses great challenge to contemporary computer vision techniques. While recent multimodal language models (MLLMs) have demonstrated impressive reasoning capabilities on high-quality table images, their performance still degrades significantly when confronted with noisy, real-world photographs. To address this challenge, we present the PCTR, a large-scale dataset specifically designed for multimodal table reasoning over photographed Chinese tables. The goal of this dataset is to evaluate and enhance the robustness and reasoning abilities of MLLMs under realistic conditions.
Task Description
The objective of this competition task is to develop a robust multimodal system that can accurately predict answers by processing both textual questions and their corresponding photographed table images. The challenge spans multiple STEM disciplines, including mathematics, physics, chemistry, biology, among others, requiring models to demonstrate effective cross-modal reasoning under real-world conditions. While the dataset uses Chinese tables, the competition task itself is language-agnostic as it evaluates the model's reasoning capabilities rather than language proficiency.
Dataset Overview
The PCTR dataset consists of 13,298 training samples and 1,000 test samples. The training data contains real-world noise and some annotation errors, while the test set has been meticulously verified by experts to ensure accuracy.
All data is organized in JSON format with accompanying image files. The training set includes train/train.json and a corresponding train/images directory, while the test set follows the same structure with test/test.json and test/images. Each JSON entry contains: a unique question ID, the file path to the associated photographed table image, a question grounded in the image content, a step-by-step annotated solution, and the final answer represented as a string. Below gives an example of JSON entry and corresponding image for training.
{"id": "4", "image": "images/train/1622044524477847261674147110912_0.jpg", "question": "初一和初二在90≤x≤100分数段的总人数是多少?", "solution": "12+15", "answer": "27"}
For testing, test.json gives the question and file path to corresponding image file.
{"id": "1", "image": "images/test/dfdb38e3500edd96d45b3398ae8b0e65.jpg", "question": "如果从周六之后开始之后到下周六利润呈等差数列,那下周三的利润是多少元?"}
Results Submission
For performance evaluation, participants should submit their model's predictions in a CSV file, including two columns:
- id: The unique ID of the question in the test set.
- model_answer: The model's predicted answer formatted as a string without units (e.g., "1538" instead of "1538 人". For yes/no questions, the answer should be either "是" or "否" in Chinese.
Your submitted CSV answer should be formatted as the example given below:
id, model_answer
"1", "**"
"2", "**"
Citation of this Dataset
Xiaoqiang Kang, Qiufeng Wang and Kaizhu Huang, PCTR: Photographed Chinese Table Reasoning, for the 16th International Cybersecurity and generative competition (CyberAI Cup 2025) hosted by International Cyber-security Data-mining Society (ICSDS).