Locus — turn any unstructured corpus into validated, source-grounded tabular data. CLI: `locus`.
Project description
Locus
Turn any unstructured corpus into validated, source-grounded tabular data — ready to feed an LLM.
Locus packages data operations as reusable, versioned images. You pull an image, point it at your own data, run it locally, and get a clean table where every cell carries its source location and a faithfulness score. Images compose into pipelines, and you can publish your own to Locus Hub (public or private).
Install
pip install locus-etl # core: CLI + engine + CSV/records/provenance
pip install "locus-etl[standard]" # + PDF, HTML, SQL, normalize, result UI
pip install "locus-etl[all]" # everything, incl. OCR/LLM/embeddings (heavy)
The CLI command is locus (Python 3.11+). Quote the brackets — your shell treats them
as a glob otherwise (zsh: no matches found).
Pick targeted extras if you prefer a lean install:
pip install "locus-etl[pdf,serve,llm,oci]" — pdf (PDF parsing), html, sql,
normalize, serve (Hub/result UI), llm (LLM engine), ocr, dedup, embeddings,
docling, oci (OCI registry), docker.
Quickstart
locus catalog list # see the official image catalog
printf 'name,amount\nAcme,100\nGlobex,200\n' > data.csv
cat > locusfile.yaml <<EOF
image: doc-to-tables
source: { type: files, path: ./data.csv }
EOF
locus run locusfile.yaml --export out.csv # grounded table + _lineage column
locus run locusfile.yaml --serve # preview UI with per-cell provenance
locus hub # browse the catalog in a local web UI
Architecture (layered)
The diagram is generated from docs/generate_architecture_diagram.py (PNG + SVG in docs/).
Layer summary
| Layer | Spec | Responsibility |
|---|---|---|
| Layer 1 — Engine | unstructured-to-tabular-etl |
Raw corpus -> validated, source-grounded table. Connectors, parsing, extraction, cleaning, the cell-level grounding/faithfulness contract, review. Embedded inside every image. |
| Layer 2 — Runtime | locus-image-runtime |
Packaging, CLI, Locusfile, image pull, multi-image composition (DAG), typed stage interchange, cross-stage provenance, serve/export, and publishing to Locus Hub. |
Key properties
- Local-first. Default runtime is a plain Python process — no daemon, no Linux VM. Docker is an optional backend.
- Privacy is explicit. Deterministic engine keeps data local; the LLM engine activates only when you add a key, with a consent notice before any data leaves.
- Trust travels with the data. Provenance and faithfulness survive every pipeline stage, from extraction through merge and redaction.
Documentation
Detailed design lives in the spec documents:
- Layer 1 engine —
.kiro/specs/unstructured-to-tabular-etl/requirements.md - Layer 2 runtime —
.kiro/specs/locus-image-runtime/requirements.md - Image catalog (planned images + build order) —
.kiro/specs/locus-image-runtime/image-catalog.md - Architecture & decision log —
.kiro/specs/unstructured-to-tabular-etl/architecture-notes.md
Status
Layer 1 engine: feature-complete. Layer 2 runtime: feature-complete (all 12 build stages done). Raw corpus → validated, source-grounded table with cell-level provenance; a deterministic default engine and opt-in guardrailed LLM engine; cleaning/dedup; human-in-the-loop review; file/HTTP/REST/SQL connectors with CSV/PDF/HTML/records parsers and DataFrame/Parquet/SQL emitters. The locus CLI runs single images and multi-stage pipelines (typed DAG with static type-check + cross-stage provenance), builds/publishes/pulls images via a local registry, and serves a local result UI with the provenance viewer. 208 tests, CI on Python 3.11/3.12 (ruff + mypy strict + pytest).
pip install locus-etl # CLI command is `locus`; extras: [pdf] [llm] [serve] [oci] ...
locus init # gitignore .env
locus catalog list # see the official image catalog
locus run locusfile.yaml # run a pipeline, get a grounded table
locus run locusfile.yaml --serve --port 8080 # preview UI with provenance
locus hub # browse the image catalog in a local web UI
locus build / push / pull / search / inspect # image lifecycle
Remaining work is the official image catalog and a hub-side discovery index. The
OCI/Harbor registry backend is implemented (OrasImageStore): set LOCUS_REGISTRY
(and optionally LOCUS_NAMESPACE) to push/pull/inspect against Harbor, GHCR, ECR, or any
OCI registry; otherwise a local filesystem registry is the zero-config default.
from locus_engine import (
Pipeline, PipelineConfig, PluginRegistry,
FileConnector, CsvParser, Connector, Parser, SourceRef,
)
registry = PluginRegistry()
registry.register(FileConnector(), Connector)
registry.register(CsvParser(), Parser)
config = PipelineConfig.load({"source": {"type": "files", "path": "./data"}})
pipeline = Pipeline(config, registry)
out = pipeline.run([SourceRef(uri="./data/invoices.csv", kind="file")])
frame = pipeline.emit(out) # pandas DataFrame with a _lineage column
print(frame)
Contributing
See CONTRIBUTING.md.
License
MIT © 2026 Dibae101
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file locus_etl-0.0.3.tar.gz.
File metadata
- Download URL: locus_etl-0.0.3.tar.gz
- Upload date:
- Size: 766.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
612b28bd5241ab424a5a1d23a9e8c5ac278212f2a981d17220a7bfc1297b6ecf
|
|
| MD5 |
412c85f2d2fa5268169019f37cc909f7
|
|
| BLAKE2b-256 |
4a0346e5c270e5e57d29ec91caed4940820f16effdd0e3230ed95cf862834cf6
|
File details
Details for the file locus_etl-0.0.3-py3-none-any.whl.
File metadata
- Download URL: locus_etl-0.0.3-py3-none-any.whl
- Upload date:
- Size: 106.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
429e4c5202b11ba82c071389868381afc2ab7fd1a750557c8138775989175b53
|
|
| MD5 |
0ab8cc1acf906590e6f2af6c65658d09
|
|
| BLAKE2b-256 |
574e6d6b598f928f1de7433a77c29140d29f65c263aab2532911612ce79e765b
|