Skip to main content

Locus — turn any unstructured corpus into validated, source-grounded tabular data. CLI: `locus`.

Project description

Locus

Turn any unstructured corpus into validated, source-grounded tabular data — ready to feed an LLM.

PyPI Python License: MIT

Locus packages data operations as reusable, versioned images. You pull an image, point it at your own data, run it locally, and get a clean table where every cell carries its source location and a faithfulness score. Images compose into pipelines, and you can publish your own to Locus Hub (public or private).

Install

pip install locus-etl          # the CLI command is `locus`

Optional extras: pip install "locus-etl[pdf,serve,llm,oci]" (PDF parsing, Hub/result UI, LLM engine, OCI registry).

Quickstart

locus catalog list                  # see the official image catalog
printf 'name,amount\nAcme,100\nGlobex,200\n' > data.csv
cat > locusfile.yaml <<EOF
image: doc-to-tables
source: { type: files, path: ./data.csv }
EOF
locus run locusfile.yaml --export out.csv   # grounded table + _lineage column
locus run locusfile.yaml --serve            # preview UI with per-cell provenance
locus hub                                    # browse the catalog in a local web UI

Architecture (layered)

Locus layered architecture

The diagram is generated from docs/generate_architecture_diagram.py (PNG + SVG in docs/).

Layer summary

Layer Spec Responsibility
Layer 1 — Engine unstructured-to-tabular-etl Raw corpus -> validated, source-grounded table. Connectors, parsing, extraction, cleaning, the cell-level grounding/faithfulness contract, review. Embedded inside every image.
Layer 2 — Runtime locus-image-runtime Packaging, CLI, Locusfile, image pull, multi-image composition (DAG), typed stage interchange, cross-stage provenance, serve/export, and publishing to Locus Hub.

Key properties

  • Local-first. Default runtime is a plain Python process — no daemon, no Linux VM. Docker is an optional backend.
  • Privacy is explicit. Deterministic engine keeps data local; the LLM engine activates only when you add a key, with a consent notice before any data leaves.
  • Trust travels with the data. Provenance and faithfulness survive every pipeline stage, from extraction through merge and redaction.

Documentation

Detailed design lives in the spec documents:

Status

Layer 1 engine: feature-complete. Layer 2 runtime: feature-complete (all 12 build stages done). Raw corpus → validated, source-grounded table with cell-level provenance; a deterministic default engine and opt-in guardrailed LLM engine; cleaning/dedup; human-in-the-loop review; file/HTTP/REST/SQL connectors with CSV/PDF/HTML/records parsers and DataFrame/Parquet/SQL emitters. The locus CLI runs single images and multi-stage pipelines (typed DAG with static type-check + cross-stage provenance), builds/publishes/pulls images via a local registry, and serves a local result UI with the provenance viewer. 208 tests, CI on Python 3.11/3.12 (ruff + mypy strict + pytest).

pip install locus-etl        # CLI command is `locus`; extras: [pdf] [llm] [serve] [oci] ...
locus init                   # gitignore .env
locus catalog list           # see the official image catalog
locus run locusfile.yaml     # run a pipeline, get a grounded table
locus run locusfile.yaml --serve --port 8080   # preview UI with provenance
locus hub                    # browse the image catalog in a local web UI
locus build / push / pull / search / inspect   # image lifecycle

Remaining work is the official image catalog and a hub-side discovery index. The OCI/Harbor registry backend is implemented (OrasImageStore): set LOCUS_REGISTRY (and optionally LOCUS_NAMESPACE) to push/pull/inspect against Harbor, GHCR, ECR, or any OCI registry; otherwise a local filesystem registry is the zero-config default.

from locus_engine import (
    Pipeline, PipelineConfig, PluginRegistry,
    FileConnector, CsvParser, Connector, Parser, SourceRef,
)

registry = PluginRegistry()
registry.register(FileConnector(), Connector)
registry.register(CsvParser(), Parser)

config = PipelineConfig.load({"source": {"type": "files", "path": "./data"}})
pipeline = Pipeline(config, registry)

out = pipeline.run([SourceRef(uri="./data/invoices.csv", kind="file")])
frame = pipeline.emit(out)          # pandas DataFrame with a _lineage column
print(frame)

Contributing

See CONTRIBUTING.md.

License

MIT © 2026 Dibae101

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

locus_etl-0.0.2.tar.gz (765.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

locus_etl-0.0.2-py3-none-any.whl (105.4 kB view details)

Uploaded Python 3

File details

Details for the file locus_etl-0.0.2.tar.gz.

File metadata

  • Download URL: locus_etl-0.0.2.tar.gz
  • Upload date:
  • Size: 765.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for locus_etl-0.0.2.tar.gz
Algorithm Hash digest
SHA256 76396b8a2d69b684f55df1906c2f83fc50e88bf80cae034c1947a6a91c699377
MD5 efcaabb3a37eab740d5bb7cbe6acd68f
BLAKE2b-256 b7f53e611138c66d91c39da25005321da59928afbcc810df9402455b7195a14d

See more details on using hashes here.

File details

Details for the file locus_etl-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: locus_etl-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 105.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for locus_etl-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fb8a9acdf4a3f0d18ebe871f1b7b60dea406ef872f52171f727498a8bac1ca31
MD5 82dd3f0a02b7b450a44cbb541844c923
BLAKE2b-256 3f35f998bdbfc7521e6754394eaac89d45033ab8e0645e87de1eb1e82d91cef5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page