Skip to main content

UNLEASH: Semantic-based Log Parser with Pre-trained Language Models

Project description

UNLEASH: Semantic-based Log Parser with Pre-trained Language Models

pypi package Build and test Upload Python Package Archived

UNLEASH is a semantic-based log parsing framework. This repository includes artifacts for reuse and reproduction of experimental results presented in our ICSE'25 paper titled "Unleashing the True Potential of Semantic-based Log Parsing with Pre-trained Language Models".

Table of Contents

Purpose

The artifacts in this repository provides the UNLEASH tool along with the neccessary benchmarks and scripts, facilitating its reuse and enabling the replication of the associated study.

Provenance

Our artifacts are available via public archival repositories, including:

Data

The datasets used in the study are publicly available at: https://zenodo.org/record/8275861. The storage requirements for the datasets are approximately 966 MB (compressed) and 13 GB (uncompressed) for 14 datasets.

During the operation of UNLEASH, the datasets will be automatically downloaded and extracted to the datasets folder by default. You can also download the datasets manually and extract them in the datasets folder. The datasets should be organized as follows:

๐Ÿ“ฆ UNLEASH
โ”œโ”€ย datasets
โ”‚ย ย โ””โ”€ย loghub-2.0
โ”‚ย ย ย ย ย โ”œโ”€ย Apache
โ”‚ย ย ย ย ย โ”‚ย ย โ”œโ”€ย Apache_full.log
โ”‚ย ย ย ย ย โ”‚ย ย โ”œโ”€ย Apache_full.log_structured.csv
โ”‚ย ย ย ย ย โ”‚ย ย โ”œโ”€ย Apache_full.log_structured_corrected.csv
โ”‚ย ย ย ย ย โ”‚ย ย โ”œโ”€ย Apache_full.log_templates.csv
โ”‚ย ย ย ย ย โ”‚ย ย โ””โ”€ย Apache_full.log_templates_corrected.csv
โ”‚ย ย ย ย ย โ”œโ”€ย ...

Setup

The code is implemented in Python 3.9. We recommend using machines equipped with at least an 4-cores CPU, an 8GB GPU, 16GB RAM, and ~50GB available disk space with Ubuntu 20.04 or Ubuntu 22.04 to stably reproduce the experimental results in our paper. The full requirements to run the code can be found at REQUIREMENTS.md.

Install Python 3.9

We recommend using Python 3.9+ to run the code.

sudo apt update
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.9 python3.9-venv python3.9-dev

Clone UNLEASH from GitHub

git clone https://github.com/LogIntelligence/UNLEASH.git && cd UNLEASH

Create and activate a virtual environment

We recommend creating a virtual environment to run the code.

python3.9 -m venv env
source env/bin/activate

Install UNLEASH from PyPI or Build from source

You can install UNLEASH from PyPI or build from source.

# Install from PyPI
pip install icse-unleash

# Build from source
pip install -e .

Usage

Test the installation

pytest tests/test.py
Expected output
============================== test session starts ===============================
platform linux -- Python 3.9.21, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/ubuntu/Documents/UNLEASH
collected 3 items                                                                

tests/test.py ...                                                          [100%]

=============================== 3 passed in 3.93s ================================

Basic usage

To perform log parsing on a specific dataset, you need to set the dataset parameter and set the working directory to the examples folder.

export dataset=Apache
cd examples

1. Run sampling for a specific dataset

python 01_sampling.py --dataset $dataset --sampling_method unleash
Expected output
Apache
Loading Apache/Apache_full.log...
https://zenodo.org/records/8275861/files/Apache.zip
--2025-01-15 10:06:19--  https://zenodo.org/records/8275861/files/Apache.zip
Resolving zenodo.org (zenodo.org)... 188.185.45.92, 188.185.48.194, 188.185.43.25, ...
Connecting to zenodo.org (zenodo.org)|188.185.45.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 578629 (565K) [application/octet-stream]
Saving to: โ€˜../datasets/loghub-2.0/Apache.zipโ€™

../datasets/loghub-2.0/Apache 100%[==============================================>] 565.07K   276KB/s    in 2.0s    

2025-01-15 10:06:22 (276 KB/s) - โ€˜../datasets/loghub-2.0/Apache.zipโ€™ saved [578629/578629]

Archive:  ../datasets/loghub-2.0/Apache.zip
  inflating: ../datasets/loghub-2.0/Apache/Apache_full.log  
  inflating: ../datasets/loghub-2.0/Apache/Apache_full.log_structured.csv  
  inflating: ../datasets/loghub-2.0/Apache/Apache_full.log_templates.csv  
Loaded 51978 logs.
Build vocab with examples:  4125
Number of coarse-grained clusters:  25
Number of fine-grained clusters:  31
hierarchical clustering time:  0.018030643463134766
Shot:  8 Coarse size:  25
8-shot sampling time:  0.03555607795715332
Shot:  16 Coarse size:  25
16-shot sampling time:  0.027220964431762695
Shot:  32 Coarse size:  25
32-shot sampling time:  0.053362369537353516
Shot:  64 Coarse size:  25
64-shot sampling time:  0.13954639434814453
Shot:  128 Coarse size:  25
128-shot sampling time:  0.2863941192626953
Shot:  256 Coarse size:  25
256-shot sampling time:  0.6433525085449219

2. Run UNLEASH on a specific dataset

python 02_run_unleash.py --log_file ../datasets/loghub-2.0/$dataset/${dataset}_full.log_structured.csv --model_name_or_path roberta-base --train_file ../datasets/loghub-2.0/$dataset/samples/unleash_32.json --validation_file ../datasets/loghub-2.0/$dataset/validation.json --dataset_name $dataset --parsing_num_processes 1 --output_dir ../results --max_train_steps 1000
Expected output
Generating train split: 32 examples [00:00, 28220.72 examples/s]
Generating validation split: 10395 examples [00:00, 4274908.33 examples/s]
2025-01-15 10:07:14,564 | unleash | DEBUG | Apache loaded with 32 train samples
2025-01-15 10:07:14,564 | unleash | DEBUG | Text column name: log - Label column name: template
Running tokenizer on train dataset: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:00<00:00, 2985.34 examples/s]
Running tokenizer on test dataset (num_proc=4): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10395/10395 [00:00<00:00, 20829.57 examples/s]
2025-01-15 10:07:15,135 | unleash | DEBUG | {'train': Dataset({
    features: ['input_ids', 'labels', 'ori_labels', 'attention_mask'],
    num_rows: 32
}), 'validation': Dataset({
    features: ['input_ids', 'labels', 'ori_labels', 'attention_mask'],
    num_rows: 10395
})}
2025-01-15 10:07:15,135 | unleash | DEBUG | Train dataloader: <torch.utils.data.dataloader.DataLoader object at 0x7907fc1e2790>
2025-01-15 10:07:15,135 | unleash | DEBUG | Validation dataloader: <torch.utils.data.dataloader.DataLoader object at 0x7907fc1e2550>
2025-01-15 10:07:15,136 | unleash | INFO | Initialized Trainer
2025-01-15 10:07:15,136 | unleash | INFO | ***** Running training *****
2025-01-15 10:07:15,136 | unleash | INFO |   Num examples = 32
2025-01-15 10:07:15,136 | unleash | INFO |   Num Epochs = 500
2025-01-15 10:07:15,136 | unleash | INFO |   Instantaneous batch size per device = 16
2025-01-15 10:07:15,136 | unleash | INFO |   Total train batch size (w. parallel, distributed & accumulation) = 16
2025-01-15 10:07:15,136 | unleash | INFO |   Gradient Accumulation steps = 1
2025-01-15 10:07:15,136 | unleash | INFO |   Total optimization steps = 1000
Loss: 0.004792781546711922: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1000/1000 [01:05<00:00, 15.16it/s]
2025-01-15 10:08:21,103 | unleash | INFO | Starting template extraction
Parsing: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 51978/51978 [00:00<00:00, 62204.15it/s]
2025-01-15 10:08:21,939 | unleash | INFO | Total time taken: 0.20595479011535645
2025-01-15 10:08:21,939 | unleash | INFO | No of model invocations: 29
2025-01-15 10:08:21,939 | unleash | INFO | Total time taken by model: 0.11258220672607422

3. Evaluate Unleash on a specific dataset

python 03_evaluation.py --output_dir ../results --dataset $dataset
Expected output
=== Evaluation on Apache ===
../results/logs/Apache_full.log_structured.csv
Start to align with null values
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 51978/51978 [00:00<00:00, 220944.35it/s]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 51978/51978 [00:00<00:00, 220116.95it/s]
Start compute grouping accuracy
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 30/30 [00:00<00:00, 1057.17it/s]
Grouping_Accuracy (GA): 1.0000, FGA: 1.0000,
Grouping Accuracy calculation done. [Time taken: 0.039]
Parsing_Accuracy (PA): 0.9953
Parsing Accuracy calculation done. [Time taken: 0.002]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 30/30 [00:00<00:00, 14847.09it/s]
PTA: 0.8000, RTA: 0.8000 FTA: 0.8000
Identify : 30, Groundtruth : 30
Template-level accuracy calculation done. [Time taken: 0.010]

Reproducibility

Parsing Performance

To reproduce the parsing performance, you can run the following command:

cd examples
bash benchmark.sh

The parsing accuracy (parsing_accuracy.csv) and parsing time (time_cost.json) will be saved in the corresponding folders in the ../results directory (e.g., ../results/iteration_01/logs).

Scalability and Generalization

  • Scalability: The scalability of UNLEASH is reflected in the parsing time and accuracy with different numbers of parsing processes. To run UNLEASH with different numbers of parsing processes, you can set the parsing_num_processes parameter in the 02_run_unleash.py script and run Step 2 again:
export num_processes=4

python 02_run_unleash.py --log_file ../datasets/loghub-2.0/$dataset/${dataset}_full.log_structured.csv --model_name_or_path roberta-base --train_file ../datasets/loghub-2.0/$dataset/samples/unleash_32.json --validation_file ../datasets/loghub-2.0/$dataset/validation.json --dataset_name $dataset --parsing_num_processes $num_processes --output_dir ../results --max_train_steps 1000
  • Generalization: The generalization of UNLEASH is reflected in the parsing accuracy on different pre-trained language models and numbers of training examples.

    • To run UNLEASH with different pre-trained language models, you can set the model_name_or_path parameter in the 02_run_unleash.py script and run Step 2 again:
    export model_name="roberta-base" # currently, we support roberta-base, microsoft/deberta-base, microsoft/codebert-base, and huggingface/CodeBERTa-small-v1
    python 02_run_unleash.py --log_file ../datasets/loghub-2.0/$dataset/${dataset}_full.log_structured.csv --model_name_or_path $model_name --train_file ../datasets/loghub-2.0/$dataset/samples/unleash_32.json --validation_file ../datasets/loghub-2.0/$dataset/validation.json --dataset_name $dataset --parsing_num_processes 1 --output_dir ../results --max_train_steps 1000
    
    • To run UNLEASH with different numbers of training examples, you can set the train_file parameter in the 02_run_unleash.py script and run Step 2 again:
    export shot=64 # can be [32, 64, 128, 256]
    python 02_run_unleash.py --log_file ../datasets/loghub-2.0/$dataset/${dataset}_full.log_structured.csv --model_name_or_path roberta-base --train_file ../datasets/loghub-2.0/$dataset/samples/unleash_$shot.json --validation_file ../datasets/loghub-2.0/$dataset/validation.json --dataset_name $dataset --parsing_num_processes 1 --output_dir ../results --max_train_steps 1000
    

Other Settings

UNLEASH provides various settings to customize the parsing process. You can set the following main parameters:

  • For sampling (Step 1 - 01_sampling.py):
    • sampling_method: The sampling method to use for selecting training examples. Currently, we support unleash, lilac, and logppt. To sample using all methods, set sampling_method to all.
  • For parsing (Step 2 - 02_run_unleash.py):
    • model_name_or_path: The pre-trained language model to use for parsing. Currently, we support roberta-base, microsoft/deberta-base, microsoft/codebert-base, and huggingface/CodeBERTa-small-v1.
    • train_file: The path to the training examples.
    • max_train_steps: The maximum number of training steps.
    • save_model: Whether to save the trained model.
    • parsing_num_processes: The number of parsing processes to use for parsing.
  • To view all available parameters, you can run:
python 02_run_unleash.py --help

Download Paper

The paper is available at ICSE_25___Unleash.pdf.

Citation

@inproceedings{le2025unleash,
  title={Unleashing the True Potential of Semantic-based Log Parsing with Pre-trained Language Models},
  author={Le, Van-Hoang and Xiao, Yi and Zhang, Hongyu},
  booktitle={Proceedings of the 47th International Conference on Software Engineering},
  year={2025}
}

Contact

For any questions, please contact Van-Hoang Le.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icse_unleash-1.0.1.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

icse_unleash-1.0.1-py3-none-any.whl (37.8 kB view details)

Uploaded Python 3

File details

Details for the file icse_unleash-1.0.1.tar.gz.

File metadata

  • Download URL: icse_unleash-1.0.1.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for icse_unleash-1.0.1.tar.gz
Algorithm Hash digest
SHA256 77a86694495a48d49930a7119ad2b50e48210b7fc226090dc0f4ca2fdf28e85c
MD5 f281a1265e2104f01cbba8386f6465c0
BLAKE2b-256 92ec76fa5716302139b732923e6e6996b1ce09c17cd24dec4cc76fff959a14ab

See more details on using hashes here.

File details

Details for the file icse_unleash-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: icse_unleash-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 37.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for icse_unleash-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2ad1935664f7be644d79c6e7fa1bfdb336f15790e92171a329689245417ece91
MD5 844bc1ab751cc000961f7e75cdf5a88f
BLAKE2b-256 c370bdc251fc41900aa51b1f1ae85814890140e7fdb44c78244600e72fd6df44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page