Task 1: TTS Distinction in Human Voices (TTSD)

Human-machine coexistence is becoming the norm, but machines carry potential risks and uncertainties when they malfunction. The TTS Distinction from Human Voices (TTSD) competition addresses this challenge by pushing the boundaries of how we differentiate between human speech and sophisticated TTS simulations.

Data

This competition dataset includes the following:

  • LJ Speech Dataset: A public domain speech dataset consisting of 13,100 short audio clips from a single speaker reading passages from 7 non-fiction books. Clips vary from 1 to 10 seconds and are about 24 hours long.
  • ASV Spoof Dataset: Generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques.

The dataset comprises a total of 42,888 audio files:

  • Human voice: Labeled ‘1’.
  • TTS: Labeled '0'.

The label is recorded as the last character of the filename for each audio file. For example, in2yr016_1.wav is human voice, qb13uc57_0.wav is TTS.

For this task, 12867 audios are provided for testing and 30021 audios for training. The statistics of the datasets are given as follows:

Dataset File Samples (Rows) Dimension (Columns)
TTSD_Train TTSD_Train.zip 30021 Last character of the file name is the label (1 human, and 0 TTS)
TTSD_Test TTSD_Test.zip 12867 2 (the last column is the class label)

Results Submission

You need to submit a CSV file that contains the test sample predictions. The CSV file should have two columns:

  1. File Name: The name of each audio file in the test set.
  2. Class Label: Your predicted class label for each audio file (‘1’ for human voice, ‘0’ for TTS).

Citation of this Dataset

Yujie Chen and Guoqiang Zhong, TTSD2024: a dataset for TTS Distinction in Human Voices, for the 15th International Cybersecurity and generative competition (CyberAI Cup 2024) hosted by International Cyber-security Data-mining Society (ICSDS).

Note: Please login to download data.

Task 2: Realtime TTS Detection in Human Voices (rTTSD)

Artificial Intelligence (AI) has become deeply integrated into daily life, making the real-time detection of AI-driven entities essential for personal and information security. The imperative for Real-Time Text-to-Speech (TTS) Detection in Human Voices is evident in our diverse environments, from home virtual assistants to public service bots. We require accurate identification of whether interactions are human or AI-mediated, which is crucial for preserving the authenticity of communication and protecting individuals from potential AI misuse. Ensuring awareness of AI interaction fosters transparency, consent, and trust, which is foundational for the secure and ethical integration of AI technologies into society. Therefore, this competition project seeks to develop precise and real-time detection methods for machine-synthesized speech.

Data

The voice dataset files are partly from the LJ Speech, partly from ASVspoof, and partly generated through speech synthesis technology.

  • LJ Speech Dataset: A public domain speech dataset consisting of 13,100 short audio clips from a single speaker reading passages from 7 non-fiction books. Clips vary from 1 to 10 seconds and are about 24 hours long. These speech samples have a sampling rate of 22.05kHz and are single channel.
  • ASV Spoof Dataset: Generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. The sound includes a sampling rate of 16kHz and 22.05kHz.

In preparing the rTTSD dataset, we chose a random synthetic voice sample and adjusted its volume to 60% of the original. Next, we mixed the modified synthetic voice (TTS) sample with a live human voice recording to create a new, blended voice. We inserted TTS voice at a random point in the stream. As a result, every audio file in this new dataset will have a portion of TTS audio. After that, we merge this TTS-injected human voice dataset with the pure human voices. In this labelling data, files containing human and TTS voice data are marked with a '1' on their file name, and files with pure human voice data are marked with a '0'. We randomly selected 30% from this competition as the Test dataset. The remaining 70% is the Train dataset for developing and training the competition models.

Dataset File Samples (Rows) Dimension
Training rTTSD_Train 33358 2 (file name and label: There are two labels in total, one with 16,632 and the other with 16,726)
Testing rTTSD_Test 14296 1 (with no label)

Results Submission

You are required to submit one CSV file containing the prediction for testing samples. The expected result submission should be a CSV file with 1 column (predicted label in the same definition as the training data class label) and 14296 rows (No header line).

Citation of this Dataset

Li Cong and Guoqiang Zhong, rTTSD2024: A dataset for real-time TTS Detection in Human Voices, for the 15th International Cybersecurity and Generative competition (CyberAI Cup 2024) hosted by International Cyber-security Data-mining Society (ICSDS).

Note: Please login to download data.

Task 3: KISTI-IDS2024: Network Intrusion Detection

With the evolution of networks and technology, cyber-attacks are becoming more sophisticated. To effectively detect them, research must incorporate the latest attack characteristics. KISTI-IDS2024 collects real-world network IDS alerts encompassing 26 types of attacks and payload information. The dataset aims to stimulate IDS research that reflects real-world environments and considers payload content.

Dataset Details

The dataset comprises 179,500 IDS alerts labelled as malicious or benign. Additionally, all payload words are embedded as 100-dimensional word vectors. There are 26 attack types and the malicious/benign ratio is 50/50.

The structure of the dataset files follows a six-column, tab-separated structure:   

  1. Category: The type of attack or benign event.
  2. Source Port: The originating port of the network traffic.
  3. Destination Port: The receiving port of the network traffic.
  4. Packet Size: The size of the network packet.
  5. Word Vector (Payload): The embedded word vector representation of the payload content.
  6. Label: Indicates whether the alert is malicious (1) or benign (2).

Train and Test Files

We have divided the dataset into 80% training and 20% test sets train and test files. (train.tsv: 143,578, test.tsv: 35,922).

Dataset File Samples (Rows) Dimension (Columns)
Training Train.tsv 143,578 6 (the last column is the class label)
Testing Test.tsv 35,922 5

Results Submission

You are required to submit one TSV file containing the prediction for testing samples. The expected result submission should be a TSV file that has 1 column (predicted label in the same definition as the training data class label) and 35,922 rows.

Citation of this Dataset

Kyu-il Kim and Jungsuk Song, KISTI-IDS2024 dataset, for the 15th Cybersecurity and Generative AI Competition (CyberAI Cup 2024), Defensible Science and Technology Security R&D Center., Korea Institute of Science and Technology Information (KISTI), June 2024.

Note: Please login to download data.