A toolkit for large network traffic datasets
Project description
The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the cesnet-datazoo package are:
- A common API for downloading, configuring, and loading of four public datasets of encrypted network traffic.
- Extensive configuration options for:
- Selection of train, validation, and test periods.
- Selection of application classes and splitting classes between known and unknown.
- Data transformations, such as feature scaling.
- Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
- Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the
Ssize containing 25 million samples.
:brain: :brain: See a related project CESNET Models providing pretrained neural networks for traffic classification. :brain: :brain:
:notebook: :notebook: Example Jupyter notebooks are included in a separate Traffic Classification Examples repository. :notebook: :notebook:
:rocket: :rocket: Transfer Learning Codebase for reproducing experiments from our paper — covering ten downstream traffic classification tasks with three transfer approaches (k-NN, linear probing, and full model fine-tuning). :rocket: :rocket:
Datasets
The cesnet-datazoo package currently provides four datasets with details in the following table (you might need to scroll the table horizontally to see all datasets).
- CESNET-TLS22
- CESNET-QUIC22
- CESNET-TLS-Year22
- CESNET-QUICEXT-25
| Name | CESNET-TLS22 | CESNET-QUIC22 | CESNET-TLS-Year22 | CESNET-QUICEXT-25 |
|---|---|---|---|---|
| Protocol | TLS | QUIC | TLS | QUIC |
| Published in | 2022 | 2023 | 2024 | 2025 |
| Collection duration | 2 weeks | 4 weeks | 1 year | 1 year |
| Collection period | 4.10.2021 - 17.10.2021 | 31.10.2022 - 27.11.2022 | 1.1.2022 - 31.12.2022 | 1.6.2024 - 31.5.2025 |
| Application count | 191 | 102 | 180 | 50 |
| Available samples | 141392195 | 153226273 | 507739073 | 194296462 |
| Available dataset sizes | XS, S, M, L | XS, S, M, L | XS, S, M, L | XS, S, M, L |
| Cite | https://doi.org/10.1016/j.comnet.2022.109467 | https://doi.org/10.1016/j.dib.2023.108888 | https://doi.org/10.1038/s41597-024-03927-4 | |
| Zenodo URL | https://zenodo.org/record/7965515 | https://zenodo.org/record/7963302 | https://zenodo.org/records/10608607 | https://zenodo.org/records/17249078 |
| Related papers | https://doi.org/10.23919/TMA58422.2023.10199052 | https://doi.org/10.1145/3768988 |
Installation
Install the package from pip with:
pip install cesnet-datazoo
or for editable install with:
pip install -e git+https://github.com/CESNET/cesnet-datazoo
Examples
Initialize dataset to create train, validation, and test dataframes
from cesnet_datazoo.datasets import CESNET_QUIC22
from cesnet_datazoo.config import DatasetConfig, AppSelection
dataset = CESNET_QUIC22("/datasets/CESNET-QUIC22/", size="XS")
dataset_config = DatasetConfig(
dataset=dataset,
apps_selection=AppSelection.ALL_KNOWN,
train_period_name="W-2022-44",
test_period_name="W-2022-45",
)
dataset.set_dataset_config_and_initialize(dataset_config)
train_dataframe = dataset.get_train_df()
val_dataframe = dataset.get_val_df()
test_dataframe = dataset.get_test_df()
The DatasetConfig class handles the configuration of datasets, and calling set_dataset_config_and_initialize initializes train, validation, and test sets with the desired configuration.
Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See CesnetDataset reference.
See more examples in the documentation.
Papers
-
DataZoo: Streamlining Traffic Classification Experiments
Jan Luxemburk and Karel Hynek
CoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023 -
CESNET-TLS-Year22: A year-spanning TLS network traffic dataset from backbone lines
Karel Hynek, Jan Luxemburk, Jaroslav Pešek, Tomáš Čejka, and Pavel Šiška
Scientific Data (Nature Portfolio), 2024 -
CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines
Jan Luxemburk, Karel Hynek, Tomáš Čejka, Andrej Lukačovič, and Pavel Šiška
Data in Brief, 2023
Acknowledgments
This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cesnet_datazoo-0.2.0.tar.gz.
File metadata
- Download URL: cesnet_datazoo-0.2.0.tar.gz
- Upload date:
- Size: 51.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
230e947ec6beb5e2c4c4467c5afbef6b62f1dc4364e723ed29603f4295a67693
|
|
| MD5 |
83e93bf866912299e3655729ae8328ad
|
|
| BLAKE2b-256 |
4896273acf47c680fe17c7fae89738eb6b4aa130e66b09bb5d2c0c3576fed1fd
|
File details
Details for the file cesnet_datazoo-0.2.0-py3-none-any.whl.
File metadata
- Download URL: cesnet_datazoo-0.2.0-py3-none-any.whl
- Upload date:
- Size: 54.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee5207b33de8fcb4099d75bd2db7718db54ad04dc1db9fdc12af2c6371597e19
|
|
| MD5 |
bfef8c7bf34b6448e118423e538165c2
|
|
| BLAKE2b-256 |
943f1bc96d61a0a3c11adf1cd7125260dce2ccfc2937979d1ab9bc3e16d7dcee
|