Skip to main content

Locus — turn any unstructured corpus into validated, source-grounded tabular data. CLI: `locus`.

Project description

Locus

Turn any unstructured corpus into validated, source-grounded tabular data — ready to feed an LLM.

PyPI Python License: MIT

Locus packages data operations as reusable, versioned images. You pull an image, point it at your own data, run it locally, and get a clean table where every cell carries its source location and a faithfulness score. Images compose into pipelines, and you can publish your own to Locus Hub (public or private).

Install

pip install locus-etl                  # core: CLI + engine + CSV/records/provenance
pip install "locus-etl[standard]"      # + PDF, HTML, SQL, normalize, result UI
pip install "locus-etl[all]"           # everything, incl. OCR/LLM/embeddings (heavy)

The CLI command is locus (Python 3.11+). Quote the brackets — your shell treats them as a glob otherwise (zsh: no matches found).

Pick targeted extras if you prefer a lean install: pip install "locus-etl[pdf,serve,llm,oci]"pdf (PDF parsing), html, sql, normalize, serve (Hub/result UI), llm (LLM engine), ocr, dedup, embeddings, docling, oci (OCI registry), docker.

Quickstart

locus catalog list                  # see the official image catalog
printf 'name,amount\nAcme,100\nGlobex,200\n' > data.csv
cat > locusfile.yaml <<EOF
image: doc-to-tables
source: { type: files, path: ./data.csv }
EOF
locus run locusfile.yaml --export out.csv   # grounded table + _lineage column
locus run locusfile.yaml --serve            # interactive result view (table + charts)
locus hub                                    # browse the catalog in a local web UI

Declare output and a live visualization right in the Locusfile (Dockerfile-style):

image: doc-to-tables
source: { type: files, path: ./data.csv }
export:
  path: ./out.json        # csv | parquet | json | markdown (inferred from extension)
expose: 8080              # serve table + column charts at http://127.0.0.1:8080

Supported inputs

Out of the box (no extra dependencies): CSV/TSV, JSON, Markdown (pipe tables), HTML, plain text and common source files, DOCX, PPTX, XLSX, ODT, EPUB, and ZIP archives (members parsed and merged). PDF text is available with the pdf extra; images need OCR (roadmap). Documents with tables yield those tables; documents that are prose yield a grounded element | text | location table so any readable file still produces a presentable result.

Architecture (layered)

Locus layered architecture

The diagram is generated from docs/generate_architecture_diagram.py (PNG + SVG in docs/).

Layer summary

Layer Spec Responsibility
Layer 1 — Engine unstructured-to-tabular-etl Raw corpus -> validated, source-grounded table. Connectors, parsing, extraction, cleaning, the cell-level grounding/faithfulness contract, review. Embedded inside every image.
Layer 2 — Runtime locus-image-runtime Packaging, CLI, Locusfile, image pull, multi-image composition (DAG), typed stage interchange, cross-stage provenance, serve/export, and publishing to Locus Hub.

Key properties

  • Local-first. Default runtime is a plain Python process — no daemon, no Linux VM. Docker is an optional backend.
  • Privacy is explicit. Deterministic engine keeps data local; the LLM engine activates only when you add a key, with a consent notice before any data leaves.
  • Trust travels with the data. Provenance and faithfulness survive every pipeline stage, from extraction through merge and redaction.

Documentation

Detailed design lives in the spec documents:

Status

Layer 1 engine: feature-complete. Layer 2 runtime: feature-complete (all 12 build stages done). Raw corpus → validated, source-grounded table with cell-level provenance; a deterministic default engine and opt-in guardrailed LLM engine; cleaning/dedup; human-in-the-loop review; file/HTTP/REST/SQL connectors with CSV/PDF/HTML/records parsers and DataFrame/Parquet/SQL emitters. The locus CLI runs single images and multi-stage pipelines (typed DAG with static type-check + cross-stage provenance), builds/publishes/pulls images via a local registry, and serves a local result UI with the provenance viewer. 208 tests, CI on Python 3.11/3.12 (ruff + mypy strict + pytest).

pip install locus-etl        # CLI command is `locus`; extras: [pdf] [llm] [serve] [oci] ...
locus init                   # gitignore .env
locus catalog list           # see the official image catalog
locus run locusfile.yaml     # run a pipeline, get a grounded table
locus run locusfile.yaml --serve --port 8080   # preview UI with provenance
locus hub                    # browse the image catalog in a local web UI
locus build / push / pull / search / inspect   # image lifecycle

Remaining work is the official image catalog and a hub-side discovery index. The OCI/Harbor registry backend is implemented (OrasImageStore): set LOCUS_REGISTRY (and optionally LOCUS_NAMESPACE) to push/pull/inspect against Harbor, GHCR, ECR, or any OCI registry; otherwise a local filesystem registry is the zero-config default.

from locus_engine import (
    Pipeline, PipelineConfig, PluginRegistry,
    FileConnector, CsvParser, Connector, Parser, SourceRef,
)

registry = PluginRegistry()
registry.register(FileConnector(), Connector)
registry.register(CsvParser(), Parser)

config = PipelineConfig.load({"source": {"type": "files", "path": "./data"}})
pipeline = Pipeline(config, registry)

out = pipeline.run([SourceRef(uri="./data/invoices.csv", kind="file")])
frame = pipeline.emit(out)          # pandas DataFrame with a _lineage column
print(frame)

Contributing

See CONTRIBUTING.md.

License

MIT © 2026 Dibae101

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

locus_etl-0.0.5.tar.gz (783.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

locus_etl-0.0.5-py3-none-any.whl (122.7 kB view details)

Uploaded Python 3

File details

Details for the file locus_etl-0.0.5.tar.gz.

File metadata

  • Download URL: locus_etl-0.0.5.tar.gz
  • Upload date:
  • Size: 783.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for locus_etl-0.0.5.tar.gz
Algorithm Hash digest
SHA256 bd4fe2b1a8ead3b1f50c4d0ff51934add047609b2e87e4cda8b0b139bfebeade
MD5 880f3b8351c532b12762dc4ebd815c45
BLAKE2b-256 7a180eb3724c40a92902c88e40b00c928adffbf3fbf72275fc21b469ccfe694c

See more details on using hashes here.

File details

Details for the file locus_etl-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: locus_etl-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 122.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for locus_etl-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 3638930f04202667f78182e5ab8c4e98709a6b4647b22731f6b1e2af5d696cc4
MD5 062661da882e2c48624abeeea4d25d69
BLAKE2b-256 e2e7f338ed09df969a2ff06aa0b7d623a5a5cf8864ac0190c10e574fec01c851

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page