Skip to main content

Reproducible log anomaly detection pipelines, from raw logs to deterministic, template-mapped sequences

Project description

AnomaLog

Codecov GitHub Actions Workflow Status GitHub Actions Workflow Status GitHub License

An orchestration-driven research framework for reproducible log anomaly detection pipelines. Converts raw logs into deterministic, template-mapped sequences ready for controlled detector experiments.

Built on Prefect, AnomaLog emphasises end-to-end reproducibility from raw log ingestion to model-ready sequences.

Motivation

Many log anomaly detection implementations focus primarily on modelling techniques while omitting the full preprocessing pipeline. Parsing details are often described but not fully reproducible from code, and experiments frequently rely on preprocessed datasets without documenting raw log handling.

“The same dataset” is not always the same once parsing choices, windowing rules, entity grouping, and leakage controls are considered.

AnomaLog provides a cache-aware, pipeline-first framework that treats log preprocessing as a first-class research artifact. Each stage, from raw ingestion → parsing → template mining → sequencing, is modular and reproducible, rather than one-off scripts with hidden assumptions.

This enables controlled ablation studies, fair model comparisons, and fully repeatable experiments from raw logs. Researchers can focus on modeling choices rather than reverse-engineering preprocessing and experiment glue.

Key Features

  • Deterministic pipeline execution. Workflow stages are fingerprinted and cached so only modified components are recomputed.

  • Protocol-driven modularity. All preprocessing stages implement explicit protocol interfaces, enabling parsers, template miners (e.g. Drain3), and sequencing strategies to be swapped without altering downstream logic.

  • Explicit sequencing strategies. Entity-based, fixed-length, and time-windowed sequences are built with deterministic split controls.

  • Dataset-first workflows. Built-in benchmark presets and custom datasets share the same public interface.

  • Scalable, artifact-first storage. Structured events are persisted in Parquet by default so expensive parsing can be reused.

Research Usage

Unlike model-centric repositories that assume preprocessed inputs, AnomaLog makes preprocessing part of the research surface. A typical workflow is:

  1. Materialise a templated dataset (raw → structured → templates).
  2. Generate deterministic sequences under an explicit split protocol.
  3. Plug in any detector that consumes TemplateSequence.

Determinism is a property of the pipeline, not the random number generator. Event ordering is defined by the default dataset backend and preserved through sequencing. This allows for reproducible train/test splits across runs without requiring random seeds.

from anomalog import SplitLabel
from anomalog.presets import bgl

dataset = bgl.build()
sequence_view = dataset.group_by_entity().with_train_fraction(0.2)

for seq in sequence_view:
    if seq.split_label == SplitLabel.TRAIN:
        ...

Custom Dataset Definition

To add a dataset, define a DatasetSpec by specifying the source, structured parser, optional label alignment, and template parser. This makes dataset provenance and preprocessing assumptions explicit and versionable.

from pathlib import Path

from anomalog import DatasetSpec
from anomalog.labels import CSVReader
from anomalog.parsers import HDFSV1Parser
from anomalog.sources import LocalZipSource

dataset = (
    DatasetSpec("my-hdfs")
    .from_source(LocalZipSource(Path("HDFS_v1.zip"), raw_logs_relpath=Path("HDFS.log")))
    .parse_with(HDFSV1Parser())
    .label_with(
        CSVReader(
            relative_path=Path("preprocessed/anomaly_label.csv"),
            entity_column="BlockId",
            label_column="Label",
        ),
    )
    .build()
)

Built-in presets

from anomalog.presets import bgl, hdfs_v1

bgl_dataset = bgl.build()
hdfs_dataset = hdfs_v1.build()

Preprocessing Ablation Studies

Preprocessing decisions such as the template miner, label alignment, and grouping strategy can be treated as experimental variables rather than hidden implementation details.

from anomalog.parsers import Drain3Parser, IdentityTemplateParser
from anomalog.presets import bgl

drain_dataset = bgl.template_with(Drain3Parser).build()
identity_dataset = bgl.template_with(IdentityTemplateParser).build()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anomalog-0.2.0.tar.gz (28.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anomalog-0.2.0-py3-none-any.whl (40.6 kB view details)

Uploaded Python 3

File details

Details for the file anomalog-0.2.0.tar.gz.

File metadata

  • Download URL: anomalog-0.2.0.tar.gz
  • Upload date:
  • Size: 28.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for anomalog-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f11a79cacf1c1df47a9670bd66fdc0fbad30d127d6fe68bb4648d91def5415e9
MD5 0ee1a11fdb0f2a60ce47db7f9009caf8
BLAKE2b-256 8daf0b66eb6259b7ff4b0da95a5a065c4de40291cfd907da1816b5f0d16120ad

See more details on using hashes here.

File details

Details for the file anomalog-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: anomalog-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 40.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for anomalog-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2891054999001cf9c0f2a61e6c1dfb0ae14d90c6581c5febff151526d76a0080
MD5 6342211a2bdf7760aba8a54ec0837155
BLAKE2b-256 ad3ab0102af32580949d204de9a97f3e599b6675b9fa5f35dd8dec39a65f6a26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page