Skip to main content

A tool for batch parsing of log files

Project description

LogBatcher

LogBatcher is a cost-effective LLM-based log parser that requires no training process or labeled data. This repository includes artifacts for reuse and reproduction of experimental results presented in our ASE'24 paper titled "Demonstration-Free: Towards More Practical Log Parsing with Large Language Models".

Work Flow

workflow Log Batcher contians three main components: Partitioning, Caching and Batching - Querying

Table of Contents

Setup

Get start

To run at the local environment:

Git Clone LogBatcher from github

git clone https://github.com/LogIntelligence/LogBatcher.git && cd LogBatcher

The code is implemented in Python 3.8. To install the required packages, run the following command (conda is optional):

conda create -n logbatcher python==3.8
conda activate logbatcher
pip install -r requirements.txt

Upload your OpenAI API Key in config.json:

{
    "api_key_from_openai": "Your API Key from OpenAI"
}

To run with docker:

Download docker image from zenodo

Docker image DOI: DOI

Running the following command

docker build -t logbatcher .
docker run -it logbatcher

Project Tree

๐Ÿ“ฆLogBatcher
 โ”ฃ ๐Ÿ“‚datasets
 โ”ƒ โ”ฃ ๐Ÿ“‚loghub-2k
 โ”ƒ โ”ƒ โ”ฃ ๐Ÿ“‚Android
 โ”ƒ โ”ƒ โ”ƒ โ”ฃ ๐Ÿ“œAndroid_2k.log
 โ”ƒ โ”ƒ โ”ƒ โ”ฃ ๐Ÿ“œAndroid_2k.log_structured.csv
 โ”ƒ โ”ƒ โ”ƒ โ”ฃ ๐Ÿ“œAndroid_2k.log_templates.csv
 โ”ƒ โ”ƒ โ”ƒ โ”ฃ ๐Ÿ“œAndroid_2k.log_structured_corrected.csv
 โ”ƒ โ”ƒ โ”ƒ โ”— ๐Ÿ“œAndroid_2k.log_templates_corrected.csv
 โ”ƒ โ”ƒ โ”ฃ ...
 โ”ƒ โ”— ๐Ÿ“‚loghub-2.0
 โ”ฃ ๐Ÿ“‚evaluation
 โ”ƒ โ”ฃ ๐Ÿ“‚utils
 โ”ƒ โ”ฃ ๐Ÿ“œlogbatcher_eval.py
 โ”ƒ โ”— ๐Ÿ“œsettings.py
 โ”ฃ ๐Ÿ“‚logbatcher
 โ”ƒ โ”ฃ ๐Ÿ“œadditional_cluster.py
 โ”ƒ โ”ฃ ๐Ÿ“œcluster.py
 โ”ƒ โ”ฃ ๐Ÿ“œparser.py
 โ”ƒ โ”ฃ ๐Ÿ“œmatching.py
 โ”ƒ โ”ฃ ๐Ÿ“œparsing_base.py
 โ”ƒ โ”ฃ ๐Ÿ“œpostprocess.py
 โ”ƒ โ”ฃ ๐Ÿ“œsample.py
 โ”ƒ โ”— ๐Ÿ“œutil.py
 โ”ฃ ๐Ÿ“‚outputs
 โ”ƒ โ”ฃ ๐Ÿ“‚figures
 โ”ƒ โ”— ๐Ÿ“‚parser
 โ”ฃ ๐Ÿ“œREADME.md
 โ”ฃ ๐Ÿ“œbenchmark.py
 โ”ฃ ๐Ÿ“œconfig.json
 โ”ฃ ๐Ÿ“œrequirements.txt
 โ”— ๐Ÿ“œdemo.py

Usage

Data format

LogBatcher mainly takes a raw log file (in plain text format) as input and outputs the parsed log file (in CSV format). A raw log file is a log file with each line representing a complete log.

Following the data format from LOGPAI, the data can also be a structured log file. A structured log file is a CSV file that includes at least the LineID and Content columns for parsing, with optional EventID and EventTemplate columns for evaluation.

Usage example

We provide a usage example for more convenient reuse, which is presented as follows. The example provides a test on a specific dataset Apache from LOGPAI. If you want to evaluate LogBatcher on your own dataset, please replace the arguments file_name and dataset_format with your own raw log file path to load log data and the corresponding dataset format to extract the contents. The results can be found in outputs/parser/test folder.

import json
from logbatcher.parsing_base import single_dataset_paring
from logbatcher.parser import Parser
from logbatcher.util import data_loader

# load api key, dataset format and parser
model, dataset, folder_name ='gpt-3.5-turbo-0125', 'Apache', 'test'
config = json.load(open('config.json', 'r'))
parser = Parser(model, folder_name, config)

# load contents from raw log file, structured log file or content list
contents = data_loader(
    file_name=f"datasets/loghub-2k/{dataset}/{dataset}_2k.log",
    dataset_format= config['datasets_format'][dataset],
    file_format ='raw'
)

# parse logs
single_dataset_paring(
    dataset=dataset,
    contents=contents,
    output_dir= f'outputs/parser/{folder_name}/',
    parser=parser,
    debug=False
)
Expected output
python demo.py
Parsing 2000 logs in dataset Apache...
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2000/2000 [00:04<00:00, 420.55log/s]
parsing time: 4.756490230560303
idetified templates: 6

Example Evaluation

To evaluate the output of the usage example, run the following command

cd evaluation && python logbatcher_eval.py --config test --dataset Apache
Expected output
Calculating Edit Distance....
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2000/2000 [00:00<00:00, 4029110.47it/s]
Normalized_Edit_distance (NED): 1.0000, ED: 0.0000,
Grouping Accuracy calculation done. [Time taken: 0.002]
Start compute grouping accuracy
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 6/6 [00:00<00:00, 2084.64it/s]
Grouping_Accuracy (GA): 1.0000, FGA: 1.0000,
Grouping Accuracy calculation done. [Time taken: 0.006]
Parsing_Accuracy (PA): 1.0000
Parsing Accuracy calculation done. [Time taken: 0.001]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 6/6 [00:00<00:00, 10677.06it/s]
PTA: 1.0000, RTA: 1.0000 FTA: 1.0000
Identify : 6, Groundtruth : 6
Template-level accuracy calculation done. [Time taken: 0.003]

The results of evaluation metrics can be found in outputs/parser/test folder

Benchmark

Prepare datasets

We have already provided loghub-2k datasets in datasets/loghub-2.0 folder.

if you want to benchmark on Loghub-2.0 datasets, please Run datasets/loghub-2.0/download.sh or download the datasets:

  1. Datasets DOI: DOI
  2. Datasets Homepage: Loghub-2.0

Reproduce

To benchmark on all datasets in loghub-2k or loghub-2.0, you can run the following commands:

python benchmark.py --data_type [DATATYPE] --model [MODEL] --batch_size [BATCHSIZE] --chunk_size [CHUNKSIZE] --sampling_method [SAMPLINGMETHOD]

The description of the arguments can be found in benchmark.py or below:

--data_type
  Datasets type, Options: ['2k', 'full'], default: '2k'.
--model
  the Large Lauguage model used in LogBatcher, default: 'gpt-3.5-turbo-0125'.
--batch_size
  size of a batch query, default: 10.
--chunk_size
  size of a log chunk, default: 2000.
--clustering_method
  clustering method used in the partitioning stage, Options: ['dbscan', 'meanshift', 'hierarchical'], default: 'dbscan'.
--sampling_method
  sampling method used in the batching stage, Options: ['dpp', 'similar', 'random'], default: 'dpp'.

Benchmark Evaluation

To evaluate the output of benchmark, run the following command

cd evaluation && python logbatcher_eval.py --config logbatcher_2k

The expected results will be similar with that presented in the paper, also see experimental_results.

The description of the arguments:

--config
  The folder name of the outputs, Options: ['test', 'logbatcher_2k', 'logbatcher_full']
--data_type
  Datasets type, Options: ['2k', 'full'], default: '2k'
--dataset
  To evaluate on a single dataset, default: 'null'.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

logbatcher-0.1.1-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file logbatcher-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: logbatcher-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for logbatcher-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7a937ee1d2e3dbd2203fcc16e85d4f18a0b31be9576e48944241c4edc4fb317a
MD5 aaa8ea80e8573b59cd4fc898250e5614
BLAKE2b-256 dc95cfc7c288c8b880174ac92de5bf33d43a802a55da2b27954df1aa10b0fec4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page