Skip to main content

Locus — turn any unstructured corpus into validated, source-grounded tabular data. CLI: `locus`.

Project description

Locus

Turn any unstructured corpus into validated, source-grounded tabular data — ready to feed an LLM.

PyPI Python License: MIT

Locus packages data operations as reusable, versioned images. You pull an image, point it at your own data, run it locally, and get a clean table where every cell carries its source location and a faithfulness score. Images compose into pipelines, and you can publish your own to Locus Hub (public or private).

Install

pip install locus-etl                  # core: CLI + engine + CSV/records/provenance
pip install "locus-etl[standard]"      # + PDF, HTML, SQL, normalize, result UI
pip install "locus-etl[all]"           # everything, incl. OCR/LLM/embeddings (heavy)

The CLI command is locus (Python 3.11+). Quote the brackets — your shell treats them as a glob otherwise (zsh: no matches found).

Pick targeted extras if you prefer a lean install: pip install "locus-etl[pdf,serve,llm,oci]"pdf (PDF parsing), html, sql, normalize, serve (Hub/result UI), llm (LLM engine), ocr, dedup, embeddings, docling, oci (OCI registry), docker.

Quickstart

locus catalog list                  # see the official image catalog
printf 'name,amount\nAcme,100\nGlobex,200\n' > data.csv
cat > locusfile.yaml <<EOF
image: doc-to-tables
source: { type: files, path: ./data.csv }
EOF
locus run locusfile.yaml --export out.csv   # grounded table + _lineage column
locus run locusfile.yaml --serve            # preview UI with per-cell provenance
locus hub                                    # browse the catalog in a local web UI

Architecture (layered)

Locus layered architecture

The diagram is generated from docs/generate_architecture_diagram.py (PNG + SVG in docs/).

Layer summary

Layer Spec Responsibility
Layer 1 — Engine unstructured-to-tabular-etl Raw corpus -> validated, source-grounded table. Connectors, parsing, extraction, cleaning, the cell-level grounding/faithfulness contract, review. Embedded inside every image.
Layer 2 — Runtime locus-image-runtime Packaging, CLI, Locusfile, image pull, multi-image composition (DAG), typed stage interchange, cross-stage provenance, serve/export, and publishing to Locus Hub.

Key properties

  • Local-first. Default runtime is a plain Python process — no daemon, no Linux VM. Docker is an optional backend.
  • Privacy is explicit. Deterministic engine keeps data local; the LLM engine activates only when you add a key, with a consent notice before any data leaves.
  • Trust travels with the data. Provenance and faithfulness survive every pipeline stage, from extraction through merge and redaction.

Documentation

Detailed design lives in the spec documents:

Status

Layer 1 engine: feature-complete. Layer 2 runtime: feature-complete (all 12 build stages done). Raw corpus → validated, source-grounded table with cell-level provenance; a deterministic default engine and opt-in guardrailed LLM engine; cleaning/dedup; human-in-the-loop review; file/HTTP/REST/SQL connectors with CSV/PDF/HTML/records parsers and DataFrame/Parquet/SQL emitters. The locus CLI runs single images and multi-stage pipelines (typed DAG with static type-check + cross-stage provenance), builds/publishes/pulls images via a local registry, and serves a local result UI with the provenance viewer. 208 tests, CI on Python 3.11/3.12 (ruff + mypy strict + pytest).

pip install locus-etl        # CLI command is `locus`; extras: [pdf] [llm] [serve] [oci] ...
locus init                   # gitignore .env
locus catalog list           # see the official image catalog
locus run locusfile.yaml     # run a pipeline, get a grounded table
locus run locusfile.yaml --serve --port 8080   # preview UI with provenance
locus hub                    # browse the image catalog in a local web UI
locus build / push / pull / search / inspect   # image lifecycle

Remaining work is the official image catalog and a hub-side discovery index. The OCI/Harbor registry backend is implemented (OrasImageStore): set LOCUS_REGISTRY (and optionally LOCUS_NAMESPACE) to push/pull/inspect against Harbor, GHCR, ECR, or any OCI registry; otherwise a local filesystem registry is the zero-config default.

from locus_engine import (
    Pipeline, PipelineConfig, PluginRegistry,
    FileConnector, CsvParser, Connector, Parser, SourceRef,
)

registry = PluginRegistry()
registry.register(FileConnector(), Connector)
registry.register(CsvParser(), Parser)

config = PipelineConfig.load({"source": {"type": "files", "path": "./data"}})
pipeline = Pipeline(config, registry)

out = pipeline.run([SourceRef(uri="./data/invoices.csv", kind="file")])
frame = pipeline.emit(out)          # pandas DataFrame with a _lineage column
print(frame)

Contributing

See CONTRIBUTING.md.

License

MIT © 2026 Dibae101

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

locus_etl-0.0.3.tar.gz (766.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

locus_etl-0.0.3-py3-none-any.whl (106.1 kB view details)

Uploaded Python 3

File details

Details for the file locus_etl-0.0.3.tar.gz.

File metadata

  • Download URL: locus_etl-0.0.3.tar.gz
  • Upload date:
  • Size: 766.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for locus_etl-0.0.3.tar.gz
Algorithm Hash digest
SHA256 612b28bd5241ab424a5a1d23a9e8c5ac278212f2a981d17220a7bfc1297b6ecf
MD5 412c85f2d2fa5268169019f37cc909f7
BLAKE2b-256 4a0346e5c270e5e57d29ec91caed4940820f16effdd0e3230ed95cf862834cf6

See more details on using hashes here.

File details

Details for the file locus_etl-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: locus_etl-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 106.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for locus_etl-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 429e4c5202b11ba82c071389868381afc2ab7fd1a750557c8138775989175b53
MD5 0ab8cc1acf906590e6f2af6c65658d09
BLAKE2b-256 574e6d6b598f928f1de7433a77c29140d29f65c263aab2532911612ce79e765b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page