Download and utilize specially curated and extracted datasets from the original UNSW-NB15 and CIC-IDS2017 datasets
Project description
NIDS Datasets
The nids-datasets
package provides functionality to download and utilize specially curated and extracted datasets from the original UNSW-NB15 and CIC-IDS2017 datasets. These datasets, which initially were only flow datasets, have been enhanced to include packet-level information from the raw PCAP files. The dataset contains both packet-level and flow-level data for over 230 million packets, with 179 million packets from UNSW-NB15 and 54 million packets from CIC-IDS2017.
Installation
Install the nids-datasets
package using pip:
pip install nids-datasets
Import the package in your Python script:
from nids_datasets import Dataset, DatasetInfo
Dataset Information
The nids-datasets
package currently supports two datasets: UNSW-NB15 and CIC-IDS2017. Each of these datasets contains a mix of normal traffic and different types of attack traffic, which are identified by their respective labels. The UNSW-NB15 dataset has 10 unique class labels, and the CIC-IDS2017 dataset has 24 unique class labels.
- UNSW-NB15 Labels: 'normal', 'exploits', 'dos', 'fuzzers', 'generic', 'reconnaissance', 'worms', 'shellcode', 'backdoor', 'analysis'
- CIC-IDS2017 Labels: 'BENIGN', 'FTP-Patator', 'SSH-Patator', 'DoS slowloris', 'DoS Slowhttptest', 'DoS Hulk', 'Heartbleed', 'Web Attack – Brute Force', 'Web Attack – XSS', 'Web Attack – SQL Injection', 'Infiltration', 'Bot', 'PortScan', 'DDoS', 'normal', 'exploits', 'dos', 'fuzzers', 'generic', 'reconnaissance', 'worms', 'shellcode', 'backdoor', 'analysis', 'DoS GoldenEye'
Subsets of the Dataset
Each dataset consists of four subsets:
- Network-Flows - Contains flow-level data.
- Packet-Fields - Contains packet header information.
- Packet-Bytes - Contains packet byte information in the range (0-255).
- Payload-Bytes - Contains payload byte information in the range (0-255).
Each subset contains 18 files (except Network-Flows, which has one file), where the data is stored in parquet format. In total, this package provides access to 110 files. You can choose to download all subsets or select specific subsets or specific files depending on your analysis requirements.
Getting Information on the Datasets
The DatasetInfo
function provides a summary of the dataset in a pandas dataframe format. It displays the number of packets for each class label across all 18 files in the dataset. This overview can guide you in selecting specific files for download and analysis.
df = DatasetInfo(dataset='UNSW-NB15') # or dataset='CIC-IDS2017'
df
Downloading the Datasets
The Dataset
class allows you to specify the dataset, subset, and files that you are interested in. The specified data will then be downloaded.
dataset = 'UNSW-NB15' # or 'CIC-IDS2017'
subset = ['Network-Flows', 'Packet-Fields', 'Payload-Bytes'] # or 'all' for all subsets
files = [3, 5, 10] # or 'all' for all files
data = Dataset(dataset=dataset, subset=subset, files=files)
data.download()
The directory structure after downloading files:
UNSW-NB15
│
├───Network-Flows
│ └───UNSW_Flow.parquet
│
├───Packet-Fields
│ ├───Packet_Fields_File_3.parquet
│ ├───Packet_Fields_File_5.parquet
│ └───Packet_Fields_File_10.parquet
│
└───Payload-Bytes
├───Payload_Bytes_File_3.parquet
├───Payload_Bytes_File_5.parquet
└───Payload_Bytes_File_10.parquet
You can then load the parquet files using pandas:
import pandas as pd
df = pd.read_parquet('UNSW-NB15/Packet-Fields/Packet_Fields_File_10.parquet')
Merging Subsets
The merge()
method allows you to merge all data of each packet across all subsets, providing both flow-level and packet-level information in a single file.
data.merge()
The merge method, by default, uses the details specified while instantiating the Dataset
class. You can also pass subset=list of subsets and files=list of files you want to merge.
The directory structure after merging files:
UNSW-NB15
│
├───Network-Flows
│ └───UNSW_Flow.parquet
│
├───Packet-Fields
│ ├───Packet_Fields_File_3.parquet
│ ├───Packet_Fields_File_5.parquet
│ └───Packet_Fields_File_10.parquet
│
├───Payload-Bytes
│ ├───Payload_Bytes_File_3.parquet
│ ├───Payload_Bytes_File_5.parquet
│ └───Payload_Bytes_File_10.parquet
│
└───Network-Flows+Packet-Fields+Payload-Bytes
├───Network_Flows+Packet_Fields+Payload_Bytes_File_3.parquet
├───Network_Flows+Packet_Fields+Payload_Bytes_File_5.parquet
└───Network_Flows+Packet_Fields+Payload_Bytes_File_10.parquet
Extracting Bytes
Packet-Bytes and Payload-Bytes subset contains the first 1500-1600 bytes. To retrieve all bytes (up to 65535 bytes) from the Packet-Bytes and Payload-Bytes subsets, use the Bytes()
method. This function requires files in the Packet-Fields subset to operate. You can specify how many bytes you want to extract by passing the max_bytes parameter.
data.bytes(payload=True, max_bytes=2500)
Use packet=True to extract packet bytes. You can also pass files=list of files to retrieve bytes.
The directory structure after extracting bytes:
UNSW-NB15
│
├───Network-Flows
│ └───UNSW_Flow.parquet
│
├───Packet-Fields
│ ├───Packet_Fields_File_3.parquet
│ ├───Packet_Fields_File_5.parquet
│ └───Packet_Fields_File_10.parquet
│
├───Payload-Bytes
│ ├───Payload_Bytes_File_3.parquet
│ ├───Payload_Bytes_File_5.parquet
│ └───Payload_Bytes_File_10.parquet
│
├───Network-Flows+Packet-Fields+Payload-Bytes
│ ├───Network_Flows+Packet_Fields+Payload_Bytes_File_3.parquet
│ ├───Network_Flows+Packet_Fields+Payload_Bytes_File_5.parquet
│ └───Network_Flows+Packet_Fields+Payload_Bytes_File_10.parquet
│
└───Payload-Bytes-2500
├───Payload_Bytes_File_3.parquet
├───Payload_Bytes_File_5.parquet
└───Payload_Bytes_File_10.parquet
Reading the Datasets
The read()
method allows you to read files using Hugging Face's load_dataset
method, one subset at a time. The dataset and files parameters are optional if the same details are used to instantiate the Dataset
class.
dataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2])
The read()
method returns a dataset that you can convert to a pandas dataframe or save to a CSV, parquet, or any other desired file format:
df = dataset.to_pandas()
dataset.to_csv('file_path_to_save.csv')
dataset.to_parquet('file_path_to_save.parquet')
For scenarios where you want to process one packet at a time, you can use the stream=True
parameter:
dataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], stream=True)
print(next(iter(dataset)))
Notes
The size of these datasets is large, and depending on the subset(s) selected and the number of bytes extracted, the operations can be resource-intensive. Therefore, it's recommended to ensure you have sufficient disk space and RAM when using this package.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nids_datasets-0.1.2.tar.gz
.
File metadata
- Download URL: nids_datasets-0.1.2.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c78f016e35747e39332d9d3d200240028b305a44409ea3cbfdb392cc3c9b872 |
|
MD5 | da747d559cd462534524320593543e84 |
|
BLAKE2b-256 | 056874bebcab6f623b2d9bf5e09ee6f88569bda4ca2a8b0d06dd9794fb447727 |
File details
Details for the file nids_datasets-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: nids_datasets-0.1.2-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50c1042898d869fa02ec52502ad6573c3e521254c241157119e087b558480562 |
|
MD5 | 169c7f1c2f43a2563dd5ee3f59cd4f20 |
|
BLAKE2b-256 | feca76817c94ae1fb59581dc296831145c1152f9fe8f8cc132ba0dae9482ac6a |