Reproducible log anomaly detection pipelines, from raw logs to deterministic, template-mapped sequences
Project description
AnomaLog
An orchestration-driven research framework for reproducible log anomaly detection pipelines. Converts raw logs into deterministic, template-mapped sequences ready for controlled detector experiments.
Built on Prefect, AnomaLog emphasises end-to-end reproducibility from raw log ingestion to model-ready sequences.
Motivation
Many log anomaly detection implementations focus primarily on modelling techniques while omitting the full preprocessing pipeline. Parsing details are often described but not fully reproducible from code, and experiments frequently rely on preprocessed datasets without documenting raw log handling.
“The same dataset” is not always the same once parsing choices, windowing rules, entity grouping, and leakage controls are considered.
AnomaLog provides a cache-aware, pipeline-first framework that treats log preprocessing as a first-class research artifact. Each stage, from raw ingestion → parsing → template mining → sequencing, is modular and reproducible, rather than one-off scripts with hidden assumptions.
This enables controlled ablation studies, fair model comparisons, and fully repeatable experiments from raw logs. Researchers can focus on modeling choices rather than reverse-engineering preprocessing and experiment glue.
Key Features
-
Deterministic pipeline execution. Workflow stages are fingerprinted and cached so only modified components are recomputed.
-
Protocol-driven modularity. All preprocessing stages implement explicit protocol interfaces, enabling parsers, template miners (e.g. Drain3), and sequencing strategies to be swapped without altering downstream logic.
-
Explicit sequencing strategies. Entity-based, fixed-length, and time-windowed sequences are built with deterministic split controls.
-
Dataset-first workflows. Built-in benchmark presets and custom datasets share the same public interface.
-
Scalable, artifact-first storage. Structured events are persisted in Parquet by default so expensive parsing can be reused.
Research Usage
Unlike model-centric repositories that assume preprocessed inputs, AnomaLog makes preprocessing part of the research surface. A typical workflow is:
- Materialise a templated dataset (raw → structured → templates).
- Generate deterministic sequences under an explicit split protocol.
- Plug in any detector that consumes TemplateSequence.
Determinism is a property of the pipeline, not the random number generator. Event ordering is defined by the default dataset backend and preserved through sequencing. This allows for reproducible train/test splits across runs without requiring random seeds.
from anomalog import SplitLabel
from anomalog.presets import bgl
dataset = bgl.build()
sequence_view = dataset.group_by_entity().with_train_fraction(0.2)
for seq in sequence_view:
if seq.split_label == SplitLabel.TRAIN:
...
Custom Dataset Definition
To add a dataset, define a DatasetSpec by specifying the source, structured parser, optional label alignment, and template parser. This makes dataset provenance and preprocessing assumptions explicit and versionable.
from pathlib import Path
from anomalog import DatasetSpec
from anomalog.labels import CSVReader
from anomalog.parsers import HDFSV1Parser
from anomalog.sources import LocalZipSource
dataset = (
DatasetSpec("my-hdfs")
.from_source(LocalZipSource(Path("HDFS_v1.zip"), raw_logs_relpath=Path("HDFS.log")))
.parse_with(HDFSV1Parser())
.label_with(
CSVReader(
relative_path=Path("preprocessed/anomaly_label.csv"),
entity_column="BlockId",
label_column="Label",
),
)
.build()
)
Built-in presets
from anomalog.presets import bgl, hdfs_v1
bgl_dataset = bgl.build()
hdfs_dataset = hdfs_v1.build()
Preprocessing Ablation Studies
Preprocessing decisions such as the template miner, label alignment, and grouping strategy can be treated as experimental variables rather than hidden implementation details.
from anomalog.parsers import Drain3Parser, IdentityTemplateParser
from anomalog.presets import bgl
drain_dataset = bgl.template_with(Drain3Parser).build()
identity_dataset = bgl.template_with(IdentityTemplateParser).build()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anomalog-0.2.0.tar.gz.
File metadata
- Download URL: anomalog-0.2.0.tar.gz
- Upload date:
- Size: 28.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f11a79cacf1c1df47a9670bd66fdc0fbad30d127d6fe68bb4648d91def5415e9
|
|
| MD5 |
0ee1a11fdb0f2a60ce47db7f9009caf8
|
|
| BLAKE2b-256 |
8daf0b66eb6259b7ff4b0da95a5a065c4de40291cfd907da1816b5f0d16120ad
|
File details
Details for the file anomalog-0.2.0-py3-none-any.whl.
File metadata
- Download URL: anomalog-0.2.0-py3-none-any.whl
- Upload date:
- Size: 40.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2891054999001cf9c0f2a61e6c1dfb0ae14d90c6581c5febff151526d76a0080
|
|
| MD5 |
6342211a2bdf7760aba8a54ec0837155
|
|
| BLAKE2b-256 |
ad3ab0102af32580949d204de9a97f3e599b6675b9fa5f35dd8dec39a65f6a26
|