A tool for batch parsing of log files

Project description

LogBatcher

LogBatcher is a cost-effective LLM-based log parser that requires no training process or labeled data. This repository includes artifacts for reuse and reproduction of experimental results presented in our ASE'24 paper titled "Demonstration-Free: Towards More Practical Log Parsing with Large Language Models".

Work Flow

workflow Log Batcher contians three main components: Partitioning, Caching and Batching - Querying

Table of Contents

Setup
- Get start
- Project Tree
Usage
Benchmark

Setup

Get start

To run at the local environment:

Git Clone LogBatcher from github

git clone https://github.com/LogIntelligence/LogBatcher.git && cd LogBatcher

The code is implemented in Python 3.8. To install the required packages, run the following command (conda is optional):

conda create -n logbatcher python==3.8
conda activate logbatcher
pip install -r requirements.txt

Upload your OpenAI API Key in config.json:

{
    "api_key_from_openai": "Your API Key from OpenAI"
}

To run with docker:

Download docker image from zenodo

Docker image DOI:

Running the following command

docker build -t logbatcher .
docker run -it logbatcher

Project Tree

📦LogBatcher
 ┣ 📂datasets
 ┃ ┣ 📂loghub-2k
 ┃ ┃ ┣ 📂Android
 ┃ ┃ ┃ ┣ 📜Android_2k.log
 ┃ ┃ ┃ ┣ 📜Android_2k.log_structured.csv
 ┃ ┃ ┃ ┣ 📜Android_2k.log_templates.csv
 ┃ ┃ ┃ ┣ 📜Android_2k.log_structured_corrected.csv
 ┃ ┃ ┃ ┗ 📜Android_2k.log_templates_corrected.csv
 ┃ ┃ ┣ ...
 ┃ ┗ 📂loghub-2.0
 ┣ 📂evaluation
 ┃ ┣ 📂utils
 ┃ ┣ 📜logbatcher_eval.py
 ┃ ┗ 📜settings.py
 ┣ 📂logbatcher
 ┃ ┣ 📜additional_cluster.py
 ┃ ┣ 📜cluster.py
 ┃ ┣ 📜parser.py
 ┃ ┣ 📜matching.py
 ┃ ┣ 📜parsing_base.py
 ┃ ┣ 📜postprocess.py
 ┃ ┣ 📜sample.py
 ┃ ┗ 📜util.py
 ┣ 📂outputs
 ┃ ┣ 📂figures
 ┃ ┗ 📂parser
 ┣ 📜README.md
 ┣ 📜benchmark.py
 ┣ 📜config.json
 ┣ 📜requirements.txt
 ┗ 📜demo.py

Usage

Data format

LogBatcher mainly takes a raw log file (in plain text format) as input and outputs the parsed log file (in CSV format). A raw log file is a log file with each line representing a complete log.

Following the data format from LOGPAI, the data can also be a structured log file. A structured log file is a CSV file that includes at least the LineID and Content columns for parsing, with optional EventID and EventTemplate columns for evaluation.

Usage example

We provide a usage example for more convenient reuse, which is presented as follows. The example provides a test on a specific dataset Apache from LOGPAI. If you want to evaluate LogBatcher on your own dataset, please replace the arguments file_name and dataset_format with your own raw log file path to load log data and the corresponding dataset format to extract the contents. The results can be found in outputs/parser/test folder.

import json
from logbatcher.parsing_base import single_dataset_paring
from logbatcher.parser import Parser
from logbatcher.util import data_loader

# load api key, dataset format and parser
model, dataset, folder_name ='gpt-3.5-turbo-0125', 'Apache', 'test'
config = json.load(open('config.json', 'r'))
parser = Parser(model, folder_name, config)

# load contents from raw log file, structured log file or content list
contents = data_loader(
    file_name=f"datasets/loghub-2k/{dataset}/{dataset}_2k.log",
    dataset_format= config['datasets_format'][dataset],
    file_format ='raw'
)

# parse logs
single_dataset_paring(
    dataset=dataset,
    contents=contents,
    output_dir= f'outputs/parser/{folder_name}/',
    parser=parser,
    debug=False
)

Expected output

python demo.py
Parsing 2000 logs in dataset Apache...
100%|██████████████████████████████████| 2000/2000 [00:04<00:00, 420.55log/s]
parsing time: 4.756490230560303
idetified templates: 6

Example Evaluation

To evaluate the output of the usage example, run the following command

cd evaluation && python logbatcher_eval.py --config test --dataset Apache

Expected output

Calculating Edit Distance....
100%|███████████████████████████████████████████████████████████| 2000/2000 [00:00<00:00, 4029110.47it/s]
Normalized_Edit_distance (NED): 1.0000, ED: 0.0000,
Grouping Accuracy calculation done. [Time taken: 0.002]
Start compute grouping accuracy
100%|███████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 2084.64it/s]
Grouping_Accuracy (GA): 1.0000, FGA: 1.0000,
Grouping Accuracy calculation done. [Time taken: 0.006]
Parsing_Accuracy (PA): 1.0000
Parsing Accuracy calculation done. [Time taken: 0.001]
100%|███████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 10677.06it/s]
PTA: 1.0000, RTA: 1.0000 FTA: 1.0000
Identify : 6, Groundtruth : 6
Template-level accuracy calculation done. [Time taken: 0.003]

The results of evaluation metrics can be found in outputs/parser/test folder

Benchmark

Prepare datasets

We have already provided loghub-2k datasets in datasets/loghub-2.0 folder.

if you want to benchmark on Loghub-2.0 datasets, please Run datasets/loghub-2.0/download.sh or download the datasets:

Datasets DOI:
Datasets Homepage: Loghub-2.0

Reproduce

To benchmark on all datasets in loghub-2k or loghub-2.0, you can run the following commands:

python benchmark.py --data_type [DATATYPE] --model [MODEL] --batch_size [BATCHSIZE] --chunk_size [CHUNKSIZE] --sampling_method [SAMPLINGMETHOD]

The description of the arguments can be found in benchmark.py or below:

--data_type
  Datasets type, Options: ['2k', 'full'], default: '2k'.
--model
  the Large Lauguage model used in LogBatcher, default: 'gpt-3.5-turbo-0125'.
--batch_size
  size of a batch query, default: 10.
--chunk_size
  size of a log chunk, default: 2000.
--clustering_method
  clustering method used in the partitioning stage, Options: ['dbscan', 'meanshift', 'hierarchical'], default: 'dbscan'.
--sampling_method
  sampling method used in the batching stage, Options: ['dpp', 'similar', 'random'], default: 'dpp'.

Benchmark Evaluation

To evaluate the output of benchmark, run the following command

cd evaluation && python logbatcher_eval.py --config logbatcher_2k

The expected results will be similar with that presented in the paper, also see experimental_results.

The description of the arguments:

--config
  The folder name of the outputs, Options: ['test', 'logbatcher_2k', 'logbatcher_full']
--data_type
  Datasets type, Options: ['2k', 'full'], default: '2k'
--dataset
  To evaluate on a single dataset, default: 'null'.

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Aug 29, 2024

0.1.0

Aug 29, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

logbatcher-0.1.1-py3-none-any.whl (19.1 kB view details)

Uploaded Aug 29, 2024 Python 3

File details

Details for the file logbatcher-0.1.1-py3-none-any.whl.

File metadata

Download URL: logbatcher-0.1.1-py3-none-any.whl
Upload date: Aug 29, 2024
Size: 19.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for logbatcher-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7a937ee1d2e3dbd2203fcc16e85d4f18a0b31be9576e48944241c4edc4fb317a`
MD5	`aaa8ea80e8573b59cd4fc898250e5614`
BLAKE2b-256	`dc95cfc7c288c8b880174ac92de5bf33d43a802a55da2b27954df1aa10b0fec4`

See more details on using hashes here.

logbatcher 0.1.1

Navigation

Verified details

Maintainers

Unverified details