Skip to main content

Download and utilize specially curated and extracted datasets from the original UNSW-NB15 and CIC-IDS2017 datasets

Project description

NIDS Datasets

The nids-datasets package provides functionality to download and utilize specially curated and extracted datasets from the original UNSW-NB15 and CIC-IDS2017 datasets. These datasets, which initially were only flow datasets, have been enhanced to include packet-level information from the raw PCAP files. The dataset contains both packet-level and flow-level data for over 230 million packets, with 179 million packets from UNSW-NB15 and 54 million packets from CIC-IDS2017.

Installation

Install the nids-datasets package using pip:

pip install nids-datasets

Import the package in your Python script:

from nids_datasets import Dataset, DatasetInfo

Dataset Information

The nids-datasets package currently supports two datasets: UNSW-NB15 and CIC-IDS2017. Each of these datasets contains a mix of normal traffic and different types of attack traffic, which are identified by their respective labels. The UNSW-NB15 dataset has 10 unique class labels, and the CIC-IDS2017 dataset has 24 unique class labels.

  • UNSW-NB15 Labels: 'normal', 'exploits', 'dos', 'fuzzers', 'generic', 'reconnaissance', 'worms', 'shellcode', 'backdoor', 'analysis'
  • CIC-IDS2017 Labels: 'BENIGN', 'FTP-Patator', 'SSH-Patator', 'DoS slowloris', 'DoS Slowhttptest', 'DoS Hulk', 'Heartbleed', 'Web Attack – Brute Force', 'Web Attack – XSS', 'Web Attack – SQL Injection', 'Infiltration', 'Bot', 'PortScan', 'DDoS', 'normal', 'exploits', 'dos', 'fuzzers', 'generic', 'reconnaissance', 'worms', 'shellcode', 'backdoor', 'analysis', 'DoS GoldenEye'

Subsets of the Dataset

Each dataset consists of four subsets:

  1. Network-Flows - Contains flow-level data.
  2. Packet-Fields - Contains packet header information.
  3. Packet-Bytes - Contains packet byte information in the range (0-255).
  4. Payload-Bytes - Contains payload byte information in the range (0-255).

Each subset contains 18 files (except Network-Flows, which has one file), where the data is stored in parquet format. In total, this package provides access to 110 files. You can choose to download all subsets or select specific subsets or specific files depending on your analysis requirements.

Getting Information on the Datasets

The DatasetInfo function provides a summary of the dataset in a pandas dataframe format. It displays the number of packets for each class label across all 18 files in the dataset. This overview can guide you in selecting specific files for download and analysis.

df = DatasetInfo(dataset='UNSW-NB15') # or dataset='CIC-IDS2017'
df

Downloading the Datasets

The Dataset class allows you to specify the dataset, subset, and files that you are interested in. The specified data will then be downloaded.

dataset = 'UNSW-NB15' # or 'CIC-IDS2017'
subset = ['Network-Flows', 'Packet-Fields', 'Payload-Bytes'] # or 'all' for all subsets
files = [3, 5, 10] # or 'all' for all files

data = Dataset(dataset=dataset, subset=subset, files=files)
data.download()

The directory structure after downloading files:

UNSW-NB15
│
├───Network-Flows
│   └───UNSW_Flow.parquet
│
├───Packet-Fields
│   ├───Packet_Fields_File_3.parquet
│   ├───Packet_Fields_File_5.parquet
│   └───Packet_Fields_File_10.parquet
│
└───Payload-Bytes
    ├───Payload_Bytes_File_3.parquet
    ├───Payload_Bytes_File_5.parquet
    └───Payload_Bytes_File_10.parquet

You can then load the parquet files using pandas:

import pandas as pd
df = pd.read_parquet('UNSW-NB15/Packet-Fields/Packet_Fields_File_10.parquet')

Merging Subsets

The merge() method allows you to merge all data of each packet across all subsets, providing both flow-level and packet-level information in a single file.

data.merge()

The merge method, by default, uses the details specified while instantiating the Dataset class. You can also pass subset=list of subsets and files=list of files you want to merge.

The directory structure after merging files:

UNSW-NB15
│
├───Network-Flows
│   └───UNSW_Flow.parquet
│
├───Packet-Fields
│   ├───Packet_Fields_File_3.parquet
│   ├───Packet_Fields_File_5.parquet
│   └───Packet_Fields_File_10.parquet
│
├───Payload-Bytes
│   ├───Payload_Bytes_File_3.parquet
│   ├───Payload_Bytes_File_5.parquet
│   └───Payload_Bytes_File_10.parquet
│
└───Network-Flows+Packet-Fields+Payload-Bytes
    ├───Network_Flows+Packet_Fields+Payload_Bytes_File_3.parquet
    ├───Network_Flows+Packet_Fields+Payload_Bytes_File_5.parquet
    └───Network_Flows+Packet_Fields+Payload_Bytes_File_10.parquet

Extracting Bytes

Packet-Bytes and Payload-Bytes subset contains the first 1500-1600 bytes. To retrieve all bytes (up to 65535 bytes) from the Packet-Bytes and Payload-Bytes subsets, use the Bytes() method. This function requires files in the Packet-Fields subset to operate. You can specify how many bytes you want to extract by passing the max_bytes parameter.

data.bytes(payload=True, max_bytes=2500)

Use packet=True to extract packet bytes. You can also pass files=list of files to retrieve bytes.

The directory structure after extracting bytes:

UNSW-NB15
│
├───Network-Flows
│   └───UNSW_Flow.parquet
│
├───Packet-Fields
│   ├───Packet_Fields_File_3.parquet
│   ├───Packet_Fields_File_5.parquet
│   └───Packet_Fields_File_10.parquet
│
├───Payload-Bytes
│   ├───Payload_Bytes_File_3.parquet
│   ├───Payload_Bytes_File_5.parquet
│   └───Payload_Bytes_File_10.parquet
│
├───Network-Flows+Packet-Fields+Payload-Bytes
│   ├───Network_Flows+Packet_Fields+Payload_Bytes_File_3.parquet
│   ├───Network_Flows+Packet_Fields+Payload_Bytes_File_5.parquet
│   └───Network_Flows+Packet_Fields+Payload_Bytes_File_10.parquet
│
└───Payload-Bytes-2500
    ├───Payload_Bytes_File_3.parquet
    ├───Payload_Bytes_File_5.parquet
    └───Payload_Bytes_File_10.parquet

Reading the Datasets

The read() method allows you to read files using Hugging Face's load_dataset method, one subset at a time. The dataset and files parameters are optional if the same details are used to instantiate the Dataset class.

dataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2])

The read() method returns a dataset that you can convert to a pandas dataframe or save to a CSV, parquet, or any other desired file format:

df = dataset.to_pandas()
dataset.to_csv('file_path_to_save.csv')
dataset.to_parquet('file_path_to_save.parquet')

For scenarios where you want to process one packet at a time, you can use the stream=True parameter:

dataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], stream=True)
print(next(iter(dataset)))

Notes

The size of these datasets is large, and depending on the subset(s) selected and the number of bytes extracted, the operations can be resource-intensive. Therefore, it's recommended to ensure you have sufficient disk space and RAM when using this package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nids_datasets-0.1.2.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

nids_datasets-0.1.2-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file nids_datasets-0.1.2.tar.gz.

File metadata

  • Download URL: nids_datasets-0.1.2.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for nids_datasets-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6c78f016e35747e39332d9d3d200240028b305a44409ea3cbfdb392cc3c9b872
MD5 da747d559cd462534524320593543e84
BLAKE2b-256 056874bebcab6f623b2d9bf5e09ee6f88569bda4ca2a8b0d06dd9794fb447727

See more details on using hashes here.

File details

Details for the file nids_datasets-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for nids_datasets-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 50c1042898d869fa02ec52502ad6573c3e521254c241157119e087b558480562
MD5 169c7f1c2f43a2563dd5ee3f59cd4f20
BLAKE2b-256 feca76817c94ae1fb59581dc296831145c1152f9fe8f8cc132ba0dae9482ac6a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page