A tool for batch parsing of log files
Project description
LogBatcher
LogBatcher is a cost-effective LLM-based log parser that requires no training process or labeled data. This repository includes artifacts for reuse and reproduction of experimental results presented in our ASE'24 paper titled "Demonstration-Free: Towards More Practical Log Parsing with Large Language Models".
Work Flow
Log Batcher contians three main components: Partitioning, Caching and Batching - Querying
Table of Contents
Setup
Get start
To run at the local environment:
Git Clone LogBatcher from github
git clone https://github.com/LogIntelligence/LogBatcher.git && cd LogBatcher
The code is implemented in Python 3.8. To install the required packages, run the following command (conda is optional):
conda create -n logbatcher python==3.8
conda activate logbatcher
pip install -r requirements.txt
Upload your OpenAI API Key in config.json:
{
"api_key_from_openai": "Your API Key from OpenAI"
}
To run with docker:
Download docker image from zenodo
Running the following command
docker build -t logbatcher .
docker run -it logbatcher
Project Tree
๐ฆLogBatcher
โฃ ๐datasets
โ โฃ ๐loghub-2k
โ โ โฃ ๐Android
โ โ โ โฃ ๐Android_2k.log
โ โ โ โฃ ๐Android_2k.log_structured.csv
โ โ โ โฃ ๐Android_2k.log_templates.csv
โ โ โ โฃ ๐Android_2k.log_structured_corrected.csv
โ โ โ โ ๐Android_2k.log_templates_corrected.csv
โ โ โฃ ...
โ โ ๐loghub-2.0
โฃ ๐evaluation
โ โฃ ๐utils
โ โฃ ๐logbatcher_eval.py
โ โ ๐settings.py
โฃ ๐logbatcher
โ โฃ ๐additional_cluster.py
โ โฃ ๐cluster.py
โ โฃ ๐parser.py
โ โฃ ๐matching.py
โ โฃ ๐parsing_base.py
โ โฃ ๐postprocess.py
โ โฃ ๐sample.py
โ โ ๐util.py
โฃ ๐outputs
โ โฃ ๐figures
โ โ ๐parser
โฃ ๐README.md
โฃ ๐benchmark.py
โฃ ๐config.json
โฃ ๐requirements.txt
โ ๐demo.py
Usage
Data format
LogBatcher mainly takes a raw log file (in plain text format) as input and outputs the parsed log file (in CSV format). A raw log file is a log file with each line representing a complete log.
Following the data format from LOGPAI, the data can also be a structured log file. A structured log file is a CSV file that includes at least the LineID and Content columns for parsing, with optional EventID and EventTemplate columns for evaluation.
Usage example
We provide a usage example for more convenient reuse, which is presented as follows. The example provides a test on a specific dataset Apache from LOGPAI. If you want to evaluate LogBatcher on your own dataset, please replace the arguments file_name and dataset_format with your own raw log file path to load log data and the corresponding dataset format to extract the contents. The results can be found in outputs/parser/test folder.
import json
from logbatcher.parsing_base import single_dataset_paring
from logbatcher.parser import Parser
from logbatcher.util import data_loader
# load api key, dataset format and parser
model, dataset, folder_name ='gpt-3.5-turbo-0125', 'Apache', 'test'
config = json.load(open('config.json', 'r'))
parser = Parser(model, folder_name, config)
# load contents from raw log file, structured log file or content list
contents = data_loader(
file_name=f"datasets/loghub-2k/{dataset}/{dataset}_2k.log",
dataset_format= config['datasets_format'][dataset],
file_format ='raw'
)
# parse logs
single_dataset_paring(
dataset=dataset,
contents=contents,
output_dir= f'outputs/parser/{folder_name}/',
parser=parser,
debug=False
)
Expected output
python demo.py
Parsing 2000 logs in dataset Apache...
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2000/2000 [00:04<00:00, 420.55log/s]
parsing time: 4.756490230560303
idetified templates: 6
Example Evaluation
To evaluate the output of the usage example, run the following command
cd evaluation && python logbatcher_eval.py --config test --dataset Apache
Expected output
Calculating Edit Distance....
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2000/2000 [00:00<00:00, 4029110.47it/s]
Normalized_Edit_distance (NED): 1.0000, ED: 0.0000,
Grouping Accuracy calculation done. [Time taken: 0.002]
Start compute grouping accuracy
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 6/6 [00:00<00:00, 2084.64it/s]
Grouping_Accuracy (GA): 1.0000, FGA: 1.0000,
Grouping Accuracy calculation done. [Time taken: 0.006]
Parsing_Accuracy (PA): 1.0000
Parsing Accuracy calculation done. [Time taken: 0.001]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 6/6 [00:00<00:00, 10677.06it/s]
PTA: 1.0000, RTA: 1.0000 FTA: 1.0000
Identify : 6, Groundtruth : 6
Template-level accuracy calculation done. [Time taken: 0.003]
The results of evaluation metrics can be found in outputs/parser/test folder
Benchmark
Prepare datasets
We have already provided loghub-2k datasets in datasets/loghub-2.0 folder.
if you want to benchmark on Loghub-2.0 datasets, please Run datasets/loghub-2.0/download.sh or download the datasets:
- Datasets DOI:
- Datasets Homepage: Loghub-2.0
Reproduce
To benchmark on all datasets in loghub-2k or loghub-2.0, you can run the following commands:
python benchmark.py --data_type [DATATYPE] --model [MODEL] --batch_size [BATCHSIZE] --chunk_size [CHUNKSIZE] --sampling_method [SAMPLINGMETHOD]
The description of the arguments can be found in benchmark.py or below:
--data_type
Datasets type, Options: ['2k', 'full'], default: '2k'.
--model
the Large Lauguage model used in LogBatcher, default: 'gpt-3.5-turbo-0125'.
--batch_size
size of a batch query, default: 10.
--chunk_size
size of a log chunk, default: 2000.
--clustering_method
clustering method used in the partitioning stage, Options: ['dbscan', 'meanshift', 'hierarchical'], default: 'dbscan'.
--sampling_method
sampling method used in the batching stage, Options: ['dpp', 'similar', 'random'], default: 'dpp'.
Benchmark Evaluation
To evaluate the output of benchmark, run the following command
cd evaluation && python logbatcher_eval.py --config logbatcher_2k
The expected results will be similar with that presented in the paper, also see experimental_results.
The description of the arguments:
--config
The folder name of the outputs, Options: ['test', 'logbatcher_2k', 'logbatcher_full']
--data_type
Datasets type, Options: ['2k', 'full'], default: '2k'
--dataset
To evaluate on a single dataset, default: 'null'.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file logbatcher-0.1.1-py3-none-any.whl.
File metadata
- Download URL: logbatcher-0.1.1-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a937ee1d2e3dbd2203fcc16e85d4f18a0b31be9576e48944241c4edc4fb317a
|
|
| MD5 |
aaa8ea80e8573b59cd4fc898250e5614
|
|
| BLAKE2b-256 |
dc95cfc7c288c8b880174ac92de5bf33d43a802a55da2b27954df1aa10b0fec4
|