Skip to main content

A pluggable framework for parsing data tool artifacts into typed Python models — dbt-core first.

Project description

artifact-parser

A small, pluggable framework for turning the JSON artifacts that data tools spit out into typed, validated Python objects. Point it at a blob, get back a pydantic model — no manual key-spelunking, no guessing which schema version you're holding.

The framework is deliberately source-agnostic. Each plugin owns one family of artifacts and registers itself with a shared registry. The first one ships in the box: a full dbt-core parser (catalog, manifest, run-results, sources).

Install

uv add artifact-parser     # or: pip install artifact-parser

Quick start

The headline entry point sniffs any supported artifact and routes it to the right plugin — you don't have to know what you're holding:

import json
from artifact_parser import parse

artifact = json.loads(open("target/manifest.json").read())
model = parse(artifact)          # -> a ManifestV12 (or whatever version it is)
print(model.metadata.dbt_schema_version)

When you do know the artifact family, the dbt plugin's typed helpers are more precise (and give better editor autocomplete):

from artifact_parser.dbt import parse_manifest, parse_catalog

manifest = parse_manifest(json.loads(open("target/manifest.json").read()))
catalog = parse_catalog(json.loads(open("target/catalog.json").read()))

Hand it something it doesn't recognise and it tells you so, loudly, instead of returning a half-populated object:

from artifact_parser import parse, UnknownArtifactError

try:
    parse({"metadata": {"dbt_schema_version": "made-up/v99.json"}})
except UnknownArtifactError as exc:
    print(exc)   # No registered parser recognises this artifact. Tried: dbt.

Supported dbt artifacts

Artifact Versions Generic parser Version-pinned parsers
catalog v1 parse_catalog parse_catalog_v1
manifest v1–v12 parse_manifest parse_manifest_v1_v12
run-results v1–v6 parse_run_results parse_run_results_v1_v6
sources v1–v3 parse_sources parse_sources_v1_v3

Architecture

src/artifact_parser/
├── core/                 # the framework — no knowledge of any specific tool
│   ├── base.py           #   BaseArtifactModel (shared pydantic root)
│   ├── parser.py         #   ArtifactParser (the plugin contract)
│   ├── registry.py       #   ParserRegistry + the shared `registry` instance
│   └── exceptions.py     #   ArtifactParserError + friends
└── dbt/                  # the first plugin: dbt-core artifacts
    ├── plugin.py         #   DbtArtifactParser (implements ArtifactParser)
    ├── utils.py          #   schema-version sniffing
    ├── resources/        #   committed dbt-core JSON schemas (codegen input)
    └── generated/        #   droppable, rebuilt by `codegen dbt`
        ├── parser.py     #     parse_<artifact>[_vN] public API
        ├── version_map.py#     schema-version URL -> model class
        └── models/       #     typed pydantic models, one module per version

The generated code is walled off in generated/. You can rm -rf that whole directory and rebuild it with codegen dbt (the package still imports while it's gone — the dbt plugin just sits out until you regenerate).

The flow: a plugin answers "is this mine?" (can_parse) and "make it typed" (parse). The registry tries plugins in registration order and returns the first match. dbt registers itself on import, so parse(...) works out of the box.

Adding a new parser

The whole point of the core/ framework is that the second parser is cheap. By hand:

  1. Create src/artifact_parser/<tool>/.
  2. Define your models on BaseArtifactModel.
  3. Implement ArtifactParser (name, can_parse, parse) in plugin.py.
  4. Register it in the package __init__.py: registry.register(MyParser()).
  5. Import your plugin from the top-level artifact_parser/__init__.py.

That's it — parse() now routes matching artifacts to your plugin.

Development

This project uses uv and Task. Common targets:

Goal Task
Sync the environment task install
Format + autofix task format
Lint (format-check + ruff) task lint
Run tests at 100% coverage task test

task --list shows everything. The test suite enforces 100% coverage of the framework and dbt dispatch code (the generated dbt models are excluded — they're schema, not logic). Beyond the synthetic fixtures, real artifacts from a live dbt build live in tests/data/ and round-trip through the public parse() in tests/artifact_parser/dbt/test_roundtrip.py — the only tests that exercise populated nodes end to end.

One non-obvious rule the generator enforces: the generated models are relaxed to pydantic extra="ignore" (not the extra="forbid" dbt's schemas imply), because real artifacts carry fields the published schema omits. A strict model would reject a perfectly good manifest.json. See CLAUDE.md for the why.

CI

GitHub Actions back the same gates:

Workflow What it does
ci.yml Lint + 100%-coverage tests on Python 3.10–3.13, plus a codegen-in-sync job that fails if the committed generated/ drifts from a fresh regen.
schema-watch.yml Weekly (and on demand): probes dbt's published schemas, regenerates, and opens a PR if a new version appeared.
release.yml Build + coverage gate, then PyPI Trusted Publishing on a published Release (or TestPyPI via manual dispatch).

Action versions and Python deps are kept current by Dependabot.

Agentic setup

This repo is wired for Claude Code: a project CLAUDE.md, a parser-author subagent that owns src/, slash commands (/test, /codegen), secret-blocking and post-edit lint hooks, and the context7 MCP for pulling fresh library docs. See CLAUDE.md for the full tour. It will not write your code for you, but it tries hard to keep you from shipping a failing coverage gate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

artifact_parser-1.0.0.tar.gz (201.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

artifact_parser-1.0.0-py3-none-any.whl (261.2 kB view details)

Uploaded Python 3

File details

Details for the file artifact_parser-1.0.0.tar.gz.

File metadata

  • Download URL: artifact_parser-1.0.0.tar.gz
  • Upload date:
  • Size: 201.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for artifact_parser-1.0.0.tar.gz
Algorithm Hash digest
SHA256 73864c6b2bdcf761adcb4217125bb997d8f1b635f395d080fb4883623e5ac592
MD5 bdedd48de0d229b293ebcab86b9c933c
BLAKE2b-256 e5d3b8523c74bf12393a7e659b855cf3b42bb2b379acd64c9a95e0d2c6d51579

See more details on using hashes here.

Provenance

The following attestation bundles were made for artifact_parser-1.0.0.tar.gz:

Publisher: release.yml on datnguye/artifact-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file artifact_parser-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: artifact_parser-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 261.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for artifact_parser-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 47a827e3dffc087ef04efb1119a7118018a1fb1a6fa508162e07e511518ac1d9
MD5 4fca43e9a4f14d51d746e1fbb4403a86
BLAKE2b-256 2a4119ecd43052c8185ebdefbf41f5a16fbf94348b7b3a4387d0db2752b7d5c4

See more details on using hashes here.

Provenance

The following attestation bundles were made for artifact_parser-1.0.0-py3-none-any.whl:

Publisher: release.yml on datnguye/artifact-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page