Skip to main content

A toolkit for large network traffic datasets

Project description

The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the cesnet-datazoo package are:

  • A common API for downloading, configuring, and loading of four public datasets of encrypted network traffic.
  • Extensive configuration options for:
    • Selection of train, validation, and test periods.
    • Selection of application classes and splitting classes between known and unknown.
    • Data transformations, such as feature scaling.
  • Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
  • Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the S size containing 25 million samples.

:brain: :brain: See a related project CESNET Models providing pretrained neural networks for traffic classification. :brain: :brain:

:notebook: :notebook: Example Jupyter notebooks are included in a separate Traffic Classification Examples repository. :notebook: :notebook:

:rocket: :rocket: Transfer Learning Codebase for reproducing experiments from our paper — covering ten downstream traffic classification tasks with three transfer approaches (k-NN, linear probing, and full model fine-tuning). :rocket: :rocket:

Datasets

The cesnet-datazoo package currently provides four datasets with details in the following table (you might need to scroll the table horizontally to see all datasets).

  1. CESNET-TLS22
  2. CESNET-QUIC22
  3. CESNET-TLS-Year22
  4. CESNET-QUICEXT-25
Name CESNET-TLS22 CESNET-QUIC22 CESNET-TLS-Year22 CESNET-QUICEXT-25
Protocol TLS QUIC TLS QUIC
Published in 2022 2023 2024 2025
Collection duration 2 weeks 4 weeks 1 year 1 year
Collection period 4.10.2021 - 17.10.2021 31.10.2022 - 27.11.2022 1.1.2022 - 31.12.2022 1.6.2024 - 31.5.2025
Application count 191 102 180 50
Available samples 141392195 153226273 507739073 194296462
Available dataset sizes XS, S, M, L XS, S, M, L XS, S, M, L XS, S, M, L
Cite https://doi.org/10.1016/j.comnet.2022.109467 https://doi.org/10.1016/j.dib.2023.108888 https://doi.org/10.1038/s41597-024-03927-4
Zenodo URL https://zenodo.org/record/7965515 https://zenodo.org/record/7963302 https://zenodo.org/records/10608607 https://zenodo.org/records/17249078
Related papers https://doi.org/10.23919/TMA58422.2023.10199052 https://doi.org/10.1145/3768988

Installation

Install the package from pip with:

pip install cesnet-datazoo

or for editable install with:

pip install -e git+https://github.com/CESNET/cesnet-datazoo

Examples

Initialize dataset to create train, validation, and test dataframes

from cesnet_datazoo.datasets import CESNET_QUIC22
from cesnet_datazoo.config import DatasetConfig, AppSelection

dataset = CESNET_QUIC22("/datasets/CESNET-QUIC22/", size="XS")
dataset_config = DatasetConfig(
    dataset=dataset,
    apps_selection=AppSelection.ALL_KNOWN,
    train_period_name="W-2022-44",
    test_period_name="W-2022-45",
)
dataset.set_dataset_config_and_initialize(dataset_config)
train_dataframe = dataset.get_train_df()
val_dataframe = dataset.get_val_df()
test_dataframe = dataset.get_test_df()

The DatasetConfig class handles the configuration of datasets, and calling set_dataset_config_and_initialize initializes train, validation, and test sets with the desired configuration. Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See CesnetDataset reference.

See more examples in the documentation.

Papers

Acknowledgments

This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cesnet_datazoo-0.2.0.tar.gz (51.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cesnet_datazoo-0.2.0-py3-none-any.whl (54.5 kB view details)

Uploaded Python 3

File details

Details for the file cesnet_datazoo-0.2.0.tar.gz.

File metadata

  • Download URL: cesnet_datazoo-0.2.0.tar.gz
  • Upload date:
  • Size: 51.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.9

File hashes

Hashes for cesnet_datazoo-0.2.0.tar.gz
Algorithm Hash digest
SHA256 230e947ec6beb5e2c4c4467c5afbef6b62f1dc4364e723ed29603f4295a67693
MD5 83e93bf866912299e3655729ae8328ad
BLAKE2b-256 4896273acf47c680fe17c7fae89738eb6b4aa130e66b09bb5d2c0c3576fed1fd

See more details on using hashes here.

File details

Details for the file cesnet_datazoo-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cesnet_datazoo-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 54.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.9

File hashes

Hashes for cesnet_datazoo-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee5207b33de8fcb4099d75bd2db7718db54ad04dc1db9fdc12af2c6371597e19
MD5 bfef8c7bf34b6448e118423e538165c2
BLAKE2b-256 943f1bc96d61a0a3c11adf1cd7125260dce2ccfc2937979d1ab9bc3e16d7dcee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page