Annotate generated data artifacts

These details have been verified by PyPI

Project links

GitLab Statistics

Maintainers

These details have not been verified by PyPI

Project description

data-annotations

A Python package for attaching provenance and structured descriptions to the files and directories your workflows produce.

It is designed for lightweight research and reproducibility pipelines where you want generated datasets, tables, plots, or reports to carry enough context to explain where they came from and what they contain.

The package captures common provenance automatically and writes plain JSON and Markdown artifacts that are easy to inspect or archive. The canonical on-disk format uses one JSON annotation document per artifact:

Files use artifact.ext.annotation.json
Directories carry data-annotations.json at their root

Each annotation document stores four top-level sections:

annotation_version
subject
provenance
description

Here's the mental model: files get a visible sibling annotation, and directories carry one visible annotation at their root. Treat the annotation as part of the research output bundle.

See the changelog for release history and upgrade-oriented notes.

Installation

Install the core library from PyPI with pip:

pip install data-annotations

Or add it to a project with uv:

uv add data-annotations

The command-line interface uses optional dependencies. Install the package with CLI support when you want to run data-annotations commands:

pip install "data-annotations[cli]"
uv add "data-annotations[cli]"

For development or unreleased source installs, install directly from GitLab:

uv add "data-annotations @ git+https://gitlab.com/ceda-unibas/tools/data-annotations.git"
pip install "data-annotations @ git+https://gitlab.com/ceda-unibas/tools/data-annotations.git"

Pin a source install to a particular release tag x.y.z with:

uv add "data-annotations @ git+https://gitlab.com/ceda-unibas/tools/data-annotations.git@x.y.z"

What gets captured automatically

Every annotation document includes provenance with:

A UTC creation timestamp
Hostname and username
The script path and command-line arguments
The script path relative to the Git repo root when it can be determined
Git commit, branch, dirty state, canonical repository remote, exact tags, and git describe output when available
The current SLURM_JOB_ID when available

You can also attach your own parameters, input file paths, and function names. Local filesystem paths in provenance are stored as absolute paths. URI-style inputs such as s3://... or https://... are preserved as provided. Git tags and git_describe are human-friendly hints only; git_sha remains the source of truth for reproducibility, matching, and source checkout.

Quick Start

The recommended way to annotate your data artifacts is to decorate pipeline functions that consume some inputs and parameters, then write those artifacts. This keeps the artifact-writing logic explicit while letting data-annotations capture provenance and emit sidecars automatically.

For example, here is a complete file-level annotation workflow using the record_file_annotation(...) decorator. Once write_participants is called, it automatically generates sidecars participants.csv.annotation.json and participants.csv.README.md. The JSON sidecar will contain provenance and description metadata, and the Markdown sidecar will have a human-friendly rendering of the description provided in the decorator.

from pathlib import Path

from data_annotations.annotations import record_file_annotation
from data_annotations.description import AllowedValue, FieldDefinition

@record_file_annotation(
    title="Participant Cohort",
    summary="Participant-level cohort assignments for the validation split.",
    fields=[
        FieldDefinition(
            name="participant_id",
            data_type="string",
            summary="Stable participant identifier.",
            required=True,
            nullable=False,
        ),
        FieldDefinition(
            name="group",
            data_type="string",
            summary="Assigned study group.",
            allowed_values=[
                AllowedValue(value="control"),
                AllowedValue(value="treatment"),
            ],
        ),
    ],
    primary_key=["participant_id"],
    artifact_kind="dataset",
    acquisition_context={"source": "Study A registry export"},
    generation_context={"pipeline": "baseline-v1"},
)
def write_participants(
    artifact_path: Path,
    input_path: Path,
    split: str,
) -> Path:
    participant_ids = [
        line.strip()
        for line in input_path.read_text(encoding="utf-8").splitlines()[1:]
        if line.strip()
    ]
    artifact_path.parent.mkdir(parents=True, exist_ok=True)
    artifact_path.write_text(
        "\n".join(
            [
                "participant_id,group,split",
                *[
                    f"{participant_id},control,{split}"
                    for participant_id in participant_ids
                ],
            ]
        )
        + "\n",
        encoding="utf-8",
    )
    return artifact_path

# Annotation sidecars are written automatically
# when the decorated function is called:
artifact_path = Path("outputs") / "participants.csv"
write_participants(
    artifact_path=artifact_path,
    input_path=Path("data/raw/participants.csv"),
    split="validation",
)

print(f"{artifact_path}.annotation.json")
print(f"{artifact_path}.README.md")

Decorator Contract

You write a normal Python function and the decorator returns that function's original return value unchanged.

For provenance-bearing decorators, recorded inputs are inferred from named function arguments such as input_path and input_paths. Those arguments should correspond to real data dependencies used inside the wrapped function.

For file decorators:

record_file_manifest(...)
record_file_annotation(...)
record_file_description(...)

Your function should:

accept one argument pointing at the output file path. By default this argument is named artifact_path, but you can change the expected name with artifact_path_arg=....
use any other normal Python arguments you need for the pipeline step.
for provenance-bearing decorators, use argument names listed in input_args for real upstream dependencies you want recorded as provenance inputs. By default those names are ("input_path", "input_paths").

Your function may return any value. File decorators do not inspect that return value. Returning the generated artifact_path is recommended because it is convenient for callers, but it is not required.

For directory decorators:

record_directory_manifest(...)
record_directory_annotation(...)
record_directory_description(...)

Your function should:

accept one argument pointing at the output directory. By default this argument is named output_dir, but you can change the expected name with output_dir_arg=....
return a materialized iterable, usually a list, describing the files that were produced in that directory.
prefer returning a list or tuple rather than a generator, since the decorator needs to iterate over the outputs to write sidecars.

Accepted directory return items are:

DocumentedArtifact when you want per-artifact title, summary, fields, keys, or missing-value metadata.
DocumentedArtifactGroup for record_directory_annotation(...) and record_directory_description(...) when many files share one title, summary, kind, and optional schema metadata.
ProducedFile when you only need path, kind, and optional precomputed hash.
ChildBundle when an annotated child directory should be referenced as its own independently shareable bundle.
(path, kind) tuples when path and artifact kind are enough.
plain path-like values when the artifact kind can default to "other".

For provenance-bearing directory decorators, input_args works the same way as for file decorators: matching argument names are recorded as inputs, and the remaining bound arguments become provenance params.

Here is another decorator pattern example with record_directory_annotation(...):

from pathlib import Path

from data_annotations.annotations import record_directory_annotation
from data_annotations.description import (
    DocumentedArtifact,
    DocumentedArtifactGroup,
    FieldDefinition,
)
from data_annotations.provenance import ProducedFile

@record_directory_annotation(
    title="Validation Outputs",
    summary="Directory-level documentation for the validation run outputs.",
    acquisition_context={"source": "Study A registry export"},
    generation_context={"pipeline": "baseline-v1"},
)
def build_outputs(
    output_dir: Path,
    input_path: Path,
    split: str,
):
    participant_ids = [
        line.strip()
        for line in input_path.read_text(encoding="utf-8").splitlines()[1:]
        if line.strip()
    ]
    output_dir.mkdir(parents=True, exist_ok=True)

    table_path = output_dir / "scores.csv"
    table_path.write_text(
        "\n".join(
            [
                "participant_id,score,split",
                *[
                    f"{participant_id},0.94,{split}"
                    for participant_id in participant_ids
                ],
            ]
        )
        + "\n",
        encoding="utf-8",
    )

    report_path = output_dir / "summary.txt"
    report_path.write_text(
        (
            f"Validated {len(participant_ids)} participants from "
            f"{input_path.name} for the {split} split.\n"
        ),
        encoding="utf-8",
    )

    plot_paths = []
    for day in ["2024-01-01", "2024-01-02", "2024-01-03"]:
        plot_path = output_dir / f"sma_{day}.png"
        plot_path.write_bytes(
            (
                f"plot placeholder for the SMA variable on {day}, "
                f"derived from {input_path.name}\n"
            ).encode("utf-8")
        )
        plot_paths.append(plot_path)

    return [
        DocumentedArtifact(
            path=str(table_path),
            kind="dataset",
            title="Metrics Table",
            fields=[
                FieldDefinition(
                    name="metric",
                    data_type="string",
                    summary="Metric name.",
                ),
                FieldDefinition(
                    name="value",
                    data_type="float",
                    summary="Metric value.",
                ),
            ],
        ),
        ProducedFile(path=str(report_path), kind="report"),
        DocumentedArtifactGroup(
            title="Daily SMA plots",
            summary="Plots of the same variable on different days.",
            kind="plot",
            paths=[str(path) for path in plot_paths],
            selector="sma_*.png",
        ),
    ]


output_dir = Path("outputs") / "run-001"
build_outputs(
    output_dir=output_dir,
    input_path=Path("data/raw/participants.csv"),
    split="validation",
)

print(output_dir / "data-annotations.json")
print(output_dir / "README.md")

The decorator and direct APIs write the same canonical document shape. If you need metadata to vary per call instead of staying fixed at decoration time, use annotate_file(...), annotate_directory(...), write_file_annotation(...), or write_directory_annotation(...) directly instead. See the example gallery in examples/ for runnable examples of all approaches.

When To Use Decorators Vs Direct Functions

If a function is only a final serializer for already-prepared data, prefer the direct annotation and writer APIs. They let you attach inputs=[...] explicitly.

Canonical Document Shape

File annotations store:

subject.path
subject.kind
subject.sha256
provenance.*
description.title
description.summary
description.fields
description.primary_key
description.missing_value_codes
description.acquisition_context
description.generation_context
description.description_updated_at

Directory annotations store:

subject.path
subject.produced_files[]
subject.child_bundles[]
subject.content_digest
provenance.*
description.title
description.summary
description.artifact_groups[]
description.artifacts[]
description.acquisition_context
description.generation_context
description.description_updated_at

Use description.artifact_groups[] when many files have the same meaning, and use description.artifacts[] only for file-specific notes, overrides, or schema. Groups are descriptive only. Integrity still lives in subject.produced_files[], which tracks every concrete file by path, kind, and checksum.

The description section intentionally excludes provenance linkage fields. Directory produced_files[].path values are stored relative to subject.path, which keeps verification stable when a complete output directory is copied or archived elsewhere. subject.content_digest is computed from sorted tracked file paths, file checksums, and referenced child bundle digests.

Artifact Groups

Artifact groups are for homogeneous sets of files that researchers naturally understand as one output family: for example, 100 PNG plots of the same variable, one per acquisition day. A group stores the shared title, summary, kind, optional schema fields, and the concrete member paths. It can also store an informational selector, such as plots/*.png, to show how the group was chosen.

Rules of thumb:

Use artifact groups when many files have the same meaning.
Use individual artifacts for file-specific notes, exceptions, or overrides.
It is OK for an individual artifact to also appear in a group.
Do not rely on groups for integrity. subject.produced_files[] remains the complete checksum inventory.

Nested Directory Policy

Annotate the smallest thing you would share as a unit. If a directory is one research output, give that directory one data-annotations.json, even when its tracked files live in nested subdirectories.

Use recursive directory annotations for one bundle with nested files:

data-annotations annotate directory path/to/run-001 --recursive
data-annotations annotate directory path/to/run-001 --max-depth 2

Use child bundle annotations when a subdirectory is independently meaningful, shareable, or reusable. In that case, annotate the child directory first, then annotate the parent. The parent records a compact child_bundles[] reference with the child path, child annotation path, and child content digest; it does not copy the child file inventory into the parent JSON.

Post-hoc directory discovery follows the same rule. --recursive discovers nested files, but it stops at annotated child directories containing data-annotations.json and records them as child bundles.

Provenance Decorators And Writers

The data_annotations.provenance namespace provides provenance-only entry points. Prefer the decorators when you already have a small function that writes artifacts:

from pathlib import Path

from data_annotations.provenance import record_file_manifest


@record_file_manifest(artifact_kind="report")
def write_report(
    artifact_path: Path,
    input_path: Path,
    threshold: float = 0.5,
):
    artifact_path.parent.mkdir(parents=True, exist_ok=True)
    artifact_path.write_text(
        f"threshold applied: {threshold}\nsource={input_path.name}\n",
        encoding="utf-8",
    )


write_report(
    artifact_path=Path("outputs/summary.txt"),
    input_path=Path("data/raw/participants.csv"),
    threshold=0.75,
)

Use record_directory_manifest(...) for directory outputs. Directory decorators accept DocumentedArtifact, ProducedFile, (path, kind), and plain path-like return values. Provenance-only APIs do not accept description groups; use unified annotation or description APIs when groups should appear in the JSON or README.

If you want the direct writer approach instead, use write_file_manifest(...) and write_directory_manifest(...) (see examples/).

Description Layer

The data_annotations.description sub-package provides the structured description models used by annotation writers and the Markdown sidecar renderers. Within those models, the primary human-written narrative field is named summary.

Key public description models:

AllowedValue
FieldDefinition
DocumentedArtifact
DocumentedArtifactGroup
ArtifactDescription
ArtifactGroupDescription
FileDescription
DirectoryDescription

Description decorators and helpers:

record_file_description(...)
record_directory_description(...)
write_file_description(...)
write_directory_description(...)
render_file_readme(...)
render_directory_readme(...)

Alias helpers write_file_readme(...) and write_directory_readme(...) are supported.

Use the decorator forms when the description metadata is stable for a function, and use the direct helpers when you want to assemble descriptions per call.

Recovery Helpers

Use artifact_matches_manifest(...) to verify whether a detached artifact still matches an annotation document, and checkout_manifest_source(...) to recover the recorded code state from Git metadata.

from pathlib import Path

from data_annotations.provenance import (
    artifact_matches_manifest,
    checkout_manifest_source,
)

annotation_path = Path("outputs/participants.csv.annotation.json")
artifact_path = Path("downloads/participants.csv")

if artifact_matches_manifest(artifact_path, annotation_path):
    recovered = checkout_manifest_source(annotation_path)
    print(recovered.checkout_path)
    print(recovered.script_path)

Post-Hoc Annotation

The strongest workflow is to create provenance and description at the same time as the artifact itself. When annotations are written during generation, the package can capture runtime context directly and the resulting records are typically more complete, precise, and trustworthy.

For existing artifacts, the CLI provides a post-hoc annotation path so you can still attach provenance and description after the fact.

Post-hoc descriptions can still be very useful, but the quality of post-hoc provenance depends on how exact the supplied answers are. In particular, fields such as the generating script, command, function, Git commit, repository path, Git tags, git describe output, inputs, and parameters are only as reliable as the information entered during annotation.

CLI Workflow

This package provides a command-line interface (CLI) for retrospective annotation and provenance inspection.

For post-hoc annotation:

data-annotations annotate file path/to/participants.csv
data-annotations annotate directory path/to/run-001
data-annotations annotate directory path/to/run-001 --recursive
data-annotations annotate directory path/to/run-001 --max-depth 2
data-annotations annotate directory path/to/run-001 \
  --recursive \
  --group-selector "plots/*.png" \
  --group-title "Daily SMA plots" \
  --group-summary "Plots of the same variable on different days." \
  --group-kind plot

These commands prompt for missing details, write *.annotation.json or data-annotations.json, and optionally derive README sidecars. Post-hoc records are marked with capture_mode="post_hoc".

When group selectors are provided, the CLI expands them to concrete member paths at annotation time. Grouped files are tracked in subject.produced_files[] but are skipped by the per-file prompt flow, so you do not have to answer the same questions for every matching file.

For post-hoc provenance, use repeatable --git-tag and optional --git-describe when you know the original code state. These values are stored as human-readable hints; --git-sha remains the field used for recovery.

For provenance inspection and source recovery:

data-annotations provenance match path/to/artifact
data-annotations provenance checkout path/to/artifact

Command match auto-discovers *.annotation.json for files and data-annotations.json for directories, prints a verification summary, and suggests the exact checkout command to run next when Git recovery metadata is available.

Run With `uvx`

uvx --from "data-annotations[cli]" data-annotations provenance match path/to/participants.csv

Install And Use With `uv tool`

uv tool install "data-annotations[cli]"
data-annotations provenance match path/to/participants.csv

Run From Repository Root

From the repository root while developing locally, run task install first. That task uses uv sync --extra cli, so the CLI commands are available in the project environment. You can then run:

uv run data-annotations annotate file path/to/participants.csv
uv run data-annotations annotate directory path/to/run-001
uv run data-annotations provenance match path/to/participants.csv
uv run data-annotations provenance checkout path/to/participants.csv

API Overview

Annotation Models

FileArtifactSubject
DirectoryArtifactSubject
FileAnnotationDocument
DirectoryAnnotationDocument
FileAnnotationResult
DirectoryAnnotationResult

Annotation Decorators

record_file_annotation(...)
record_directory_annotation(...)

Annotation Functions

write_file_annotation(...)
write_directory_annotation(...)
annotate_file(...)
annotate_directory(...)

Description Models

AllowedValue
FieldDefinition
DocumentedArtifact
DocumentedArtifactGroup
ArtifactDescription
ArtifactGroupDescription
FileDescription
DirectoryDescription

Description Functions

record_file_description(...)
record_directory_description(...)
write_file_description(...)
write_directory_description(...)
write_file_readme(...)
write_directory_readme(...)
render_file_readme(...)
render_directory_readme(...)

Provenance Models

ProducedFile
ChildBundle
BaseProvenance
FileManifest
DirectoryManifest
RecoveredSource

Provenance Functions

record_file_manifest(...)
record_directory_manifest(...)
write_file_manifest(...)
write_directory_manifest(...)
directory_content_digest(...)
artifact_matches_manifest(...)
checkout_manifest_source(...)

Examples

Runnable examples live in examples/ and mirror the README workflows. Run them from the repository root with:

uv run python examples/record_file_annotation.py
uv run python examples/record_directory_annotation.py
uv run python examples/record_file_manifest.py
uv run python examples/record_directory_manifest.py
uv run python examples/record_file_description.py
uv run python examples/record_directory_description.py
uv run python examples/annotate_file.py
uv run python examples/annotate_directory.py
uv run python examples/write_file_manifest.py
uv run python examples/write_directory_manifest.py
uv run python examples/write_file_description.py
uv run python examples/write_directory_description.py
uv run python examples/recover_provenance.py
uv run python examples/recover_provenance_cli.py

Each example writes its outputs to a fresh temporary directory and prints the location so you can inspect the generated annotation documents and README sidecars.

Project details

These details have been verified by PyPI

Project links

GitLab Statistics

Maintainers

rodrigocgpena

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.12.0

Jun 22, 2026

2.11.0

Jun 19, 2026

2.10.1

Jun 18, 2026

2.10.0

Jun 17, 2026

2.9.0

Jun 17, 2026

2.8.1

Jun 16, 2026

2.8.0

Jun 15, 2026

2.7.0

Jun 4, 2026

2.6.0

Jun 3, 2026

2.5.0

Jun 2, 2026

2.4.0

Jun 1, 2026

2.3.0

May 29, 2026

This version

2.2.0

May 29, 2026

2.1.2

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_annotations-2.2.0.tar.gz (36.8 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_annotations-2.2.0-py3-none-any.whl (48.0 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file data_annotations-2.2.0.tar.gz.

File metadata

Download URL: data_annotations-2.2.0.tar.gz
Upload date: May 29, 2026
Size: 36.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for data_annotations-2.2.0.tar.gz
Algorithm	Hash digest
SHA256	`57fb9637b29f6eb1706d441b1e946dc6e11d2f3774c629817a7f85504ed74bc6`
MD5	`4e74e2102698676ec346ac0b1c9179c3`
BLAKE2b-256	`883232498b72ce14595354b81347c4ea82e2aa90e3cc1586a295eb8d3842cae1`

See more details on using hashes here.

File details

Details for the file data_annotations-2.2.0-py3-none-any.whl.

File metadata

Download URL: data_annotations-2.2.0-py3-none-any.whl
Upload date: May 29, 2026
Size: 48.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for data_annotations-2.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`46a358492411df373d75d985fa7aa20f23fd8ec87a00b28b11682f0bd356289a`
MD5	`d15054e2d79c485a1b179dbc1a3cee03`
BLAKE2b-256	`78013a410cfb13b100ba5c128850e25674b4ac2b95f84180772df07a626de16b`

See more details on using hashes here.

data-annotations 2.2.0

Navigation

Verified details

Project links

GitLab Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

data-annotations

Installation

What gets captured automatically

Quick Start

Decorator Contract

When To Use Decorators Vs Direct Functions

Canonical Document Shape

Artifact Groups

Nested Directory Policy

Provenance Decorators And Writers

Description Layer

Recovery Helpers

Post-Hoc Annotation

CLI Workflow

Run With uvx

Install And Use With uv tool

Run From Repository Root

API Overview

Annotation Models

Annotation Decorators

Annotation Functions

Description Models

Description Functions

Provenance Models

Provenance Functions

Examples

Project details

Verified details

Project links

GitLab Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Run With `uvx`

Install And Use With `uv tool`