Skip to main content

Locus — turn any unstructured corpus into validated, source-grounded tabular data. CLI: `locus`.

Project description

Locus

Turn any unstructured corpus into validated, source-grounded tabular data — ready to feed an LLM.

Locus packages data operations as reusable, versioned images. You pull an image, point it at your own data, run it locally, and get a clean table where every cell carries its source location and a faithfulness score. Images compose into pipelines, and you can publish your own to Locus Hub (public or private).

Architecture (layered)

Locus layered architecture

The diagram is generated from docs/generate_architecture_diagram.py (PNG + SVG in docs/).

Layer summary

Layer Spec Responsibility
Layer 1 — Engine unstructured-to-tabular-etl Raw corpus -> validated, source-grounded table. Connectors, parsing, extraction, cleaning, the cell-level grounding/faithfulness contract, review. Embedded inside every image.
Layer 2 — Runtime locus-image-runtime Packaging, CLI, Locusfile, image pull, multi-image composition (DAG), typed stage interchange, cross-stage provenance, serve/export, and publishing to Locus Hub.

Key properties

  • Local-first. Default runtime is a plain Python process — no daemon, no Linux VM. Docker is an optional backend.
  • Privacy is explicit. Deterministic engine keeps data local; the LLM engine activates only when you add a key, with a consent notice before any data leaves.
  • Trust travels with the data. Provenance and faithfulness survive every pipeline stage, from extraction through merge and redaction.

Documentation

Detailed design lives in the spec documents:

Status

Layer 1 engine: feature-complete. Layer 2 runtime: feature-complete (all 12 build stages done). Raw corpus → validated, source-grounded table with cell-level provenance; a deterministic default engine and opt-in guardrailed LLM engine; cleaning/dedup; human-in-the-loop review; file/HTTP/REST/SQL connectors with CSV/PDF/HTML/records parsers and DataFrame/Parquet/SQL emitters. The locus CLI runs single images and multi-stage pipelines (typed DAG with static type-check + cross-stage provenance), builds/publishes/pulls images via a local registry, and serves a local result UI with the provenance viewer. 208 tests, CI on Python 3.11/3.12 (ruff + mypy strict + pytest).

pip install locus-etl        # CLI command is `locus`; extras: [pdf] [llm] [serve] [oci] ...
locus init                   # gitignore .env
locus catalog list           # see the official image catalog
locus run locusfile.yaml     # run a pipeline, get a grounded table
locus run locusfile.yaml --serve --port 8080   # preview UI with provenance
locus hub                    # browse the image catalog in a local web UI
locus build / push / pull / search / inspect   # image lifecycle

Remaining work is the official image catalog and a hub-side discovery index. The OCI/Harbor registry backend is implemented (OrasImageStore): set LOCUS_REGISTRY (and optionally LOCUS_NAMESPACE) to push/pull/inspect against Harbor, GHCR, ECR, or any OCI registry; otherwise a local filesystem registry is the zero-config default.

from locus_engine import (
    Pipeline, PipelineConfig, PluginRegistry,
    FileConnector, CsvParser, Connector, Parser, SourceRef,
)

registry = PluginRegistry()
registry.register(FileConnector(), Connector)
registry.register(CsvParser(), Parser)

config = PipelineConfig.load({"source": {"type": "files", "path": "./data"}})
pipeline = Pipeline(config, registry)

out = pipeline.run([SourceRef(uri="./data/invoices.csv", kind="file")])
frame = pipeline.emit(out)          # pandas DataFrame with a _lineage column
print(frame)

Contributing

See CONTRIBUTING.md.

License

MIT © 2026 Dibae101

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

locus_etl-0.0.1.tar.gz (751.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

locus_etl-0.0.1-py3-none-any.whl (97.1 kB view details)

Uploaded Python 3

File details

Details for the file locus_etl-0.0.1.tar.gz.

File metadata

  • Download URL: locus_etl-0.0.1.tar.gz
  • Upload date:
  • Size: 751.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for locus_etl-0.0.1.tar.gz
Algorithm Hash digest
SHA256 73dbd9da816bc45f5f5d921fc706a5eb011ea949bf2a274f93913f6fac0d2593
MD5 4d8d090013f1c92dd97e63b9f662bc19
BLAKE2b-256 7da53333eee37b489e20e6e5418e72d259469ab1ae5d93a0d03e64647058daa4

See more details on using hashes here.

File details

Details for the file locus_etl-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: locus_etl-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 97.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for locus_etl-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b5170b23e3b0cabf18aff37e5c773b358c20495db03e18df42785eac8b7ee8d7
MD5 be89712828014cf18250e1e3c312207b
BLAKE2b-256 46564065d718630c825309e35fd9fc7b496ea41e0b920a8c75a9512221b44d8a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page