Data processing library built on top of Ibis and DataFusion to write multi-engine data workflows.
Project description
The Problem
You write a feature pipeline. It works on your laptop with DuckDB. Deploying it to Snowflake ends up in a rewrite. Intermediate results should be cached so you add infrastructure and a result naming system. A requirement to track pipeline changes is introduced, so you add a metadata store. Congrats, you're going to production! It's time to add a serving layer ...
Six months later: five tools that don't talk to each other and a pipeline only one person understands
| Pain | Symptom |
|---|---|
| Glue code everywhere | Each engine is a silo. Moving between them means rewriting, not composing. |
| Runtime Feedback | Imperative Python code where you can only tell if something will fail while running the job. |
| Unnecessary recomputations | No shared understanding of what changed. Everything runs from scratch. |
| Opaque Lineages | Feature logic, metadata, lineage. All in different systems. Debugging means archaeology. |
| "Works on my machine" | Environments drift. Reproducing results means reverse engineering someone's setup and interrogating your own. |
| Stateful orchestrators | Retry logic, task states, failure recovery. Another system to manage, another thing that breaks. |
Feature stores, Model registries, Orchestrators: Vertical silos that don't serve agentic processes, which need context and skills, not categories.
Xorq
Manifest = Context. Every ML computation becomes a structured, input-addressed YAML manifest.
Exprs = Tools. A catalog to discover. A build system to deterministically execute anywhere with user directed caching.
Templates = Skills. Various skills to get started e.g. scikit-learn pipeline, feature stores, semantic layers etc.
$ pip install xorq[examples]
$ xorq init -t penguins
The Expression
Write declarative Ibis expressions that can be run like a tool. Xorq extends Ibis with caching, multi-engine execution, and UDFs.
import ibis
import xorq.api as xo
from xorq.common.utils.ibis_utils import from_ibis
from xorq.caching import ParquetCache
penguins = ibis.examples.penguins.fetch()
penguins_agg = (
penguins
.filter(ibis._.species.notnull())
.group_by("species")
.agg(avg_bill_length=ibis._.bill_length_mm.mean())
)
expr = (
from_ibis(penguins_agg)
.cache(ParquetCache.from_kwargs())
)
Declare .cache() on any node. Xorq handles the rest. No cache keys to generate or manage,
no invalidation logic to write.
Compose across engines
One expression, many engines. Part of your pipeline runs on DuckDB, part on Xorq's embedded DataFusion engine, UDFs via Arrow Flight. Xorq systematically handles data transit with low overhead. Bye bye glue code.
expr = from_ibis(penguins).into_backend(xo.sqlite.connect())
expr.ls.backends
(<xorq.backends.sqlite.Backend at 0x7926a815caa0>,
<xorq.backends.duckdb.Backend at 0x7926b409faa0>)
Expressions are tools, Arrow is the pipe
Unix gave us small programs that compose via stdout. Xorq gives you expressions that compose via Arrow.
In [6]: expr.to_pyarrow_batches()
Out[6]: <pyarrow.lib.RecordBatchReader at 0x15dc3f570>
The Manifest
Build an expression, get a manifest.
$ xorq build expr.py
builds/28ecab08754e
$ tree builds/28ecab08754e
builds/28ecab08754e
├── database_tables
│ └── f2ac274df56894cb1505bfe8cb03940e.parquet
├── expr.yaml
├── metadata.json
└── profiles.yaml
No external metadata store. No separate lineage tool. The build directory is the versioned, cached, portable artifact.
# Input-addressed, composable, portable
# Abridged expr.yaml
nodes:
'@read_31f0a5be3771':
op: Read
name: penguins
source: builds/28ecab08754e/.../f2ac274df56894cb1505bfe8cb03940e.parquet
'@filter_23e7692b7128':
op: Filter
parent: '@read_31f0a5be3771'
predicates:
- NotNull(species)
'@remotetable_9a92039564d4':
op: RemoteTable
remote_expr:
op: Aggregate
parent: '@filter_23e7692b7128'
by: [species]
metrics:
avg_bill_length: Mean(bill_length_mm)
'@cachednode_e7b5fd7cd0a9':
op: CachedNode
parent: '@remotetable_9a92039564d4'
cache:
type: ParquetCache
path: parquet
Reproducible builds
The manifest is roundtrippable and machine-writeable. Git-diff your pipelines. Code review your features. Track python dependencies. Rebuild from YAML alone.
$ xorq uv-build expr.py
builds/28ecab08754e/
$ ls builds/28ecab08754e/*.tar.gz
builds/28ecab08754e/sdist.tar.gz builds/28ecab08754e/my-pipeline-0.1.0.tar.gz
The build captures everything: expression graph, dependencies, memory tables. Share the build that has sdist, get identical results. No "works on my machine."
Only recompute what changed
The manifest is input-addressed: same inputs = same hash. Change an input, get a new hash.
expr.ls.get_cache_paths()
(PosixPath('/home/user/.cache/xorq/parquet/letsql_cache-7c3df7ccce5ed4b64c02fbf8af462e70.parquet'),)
The hash is the cache key. No invalidation logic to debug. If the expression is the same, the hash is the same, and the cache is valid. Change an input, get a new hash, trigger recomputation.
Traditional caching asks "has this expired?" Input-addressed caching asks "is this the same computation?" The second question has a deterministic answer.
The Tools
The manifest provides context. The tools provide skills: catalog, introspect, serve, execute.
Catalog
# Add to catalog
$ xorq catalog add builds/28ecab08754e/ --alias penguins-agg
Added build 28ecab08754e as entry a498016e-5bea-4036-aec0-a6393d1b7c0f revision r1
# List entries
$ xorq catalog ls
Aliases:
penguins-agg a498016e-5bea-4036-aec0-a6393d1b7c0f r1
Entries:
a498016e-5bea-4036-aec0-a6393d1b7c0f r1 28ecab08754e
Run
$ xorq run builds/28ecab08754e -o out.parquet
Serve
Serve expressions anywhere via Arrow Flight:
$ xorq serve-unbound builds/28ecab08754e/ \
--to_unbind_hash 31f0a5be37713fe2c1a2d8ad8fdea69f \
--host localhost --port 9002
import xorq.api as xo
backend = xo.flight.connect(host="localhost", port=9002)
f = backend.get_exchange("default")
data = {
"species": ["Adelie", "Gentoo", "Chinstrap"],
"island": ["Torgersen", "Biscoe", "Dream"],
"bill_length_mm": [39.1, 47.5, 49.0],
"bill_depth_mm": [18.7, 14.2, 18.5],
"flipper_length_mm": [181, 217, 195],
"body_mass_g": [3750, 5500, 4200],
"sex": ["male", "female", "male"],
"year": [2007, 2008, 2009],
}
xo.memtable(data).pipe(f).execute()
species avg_bill_length
0 Adelie 39.1
1 Chinstrap 49.0
2 Gentoo 47.5
Debug with confidence
No more archaeology. Lineage is encoded in the manifest—not scattered across tools—and queryable from the CLI.
$ xorq lineage penguins-agg
Lineage for column 'avg_bill_length':
Field:avg_bill_length #1
└── Cache xorq_cached_node_name_placeholder #2
└── RemoteTable:236af67d399a4caaf17e0bf5e1ac4c0f #3
└── Aggregate #4
├── Filter #5
│ ├── Read #6
│ └── NotNull #7
│ └── Field:species #8
│ └── ↻ see #6
├── Field:species #9
│ └── ↻ see #5
└── Mean #10
└── Field:bill_length_mm #11
└── ↻ see #5
Workflows, without state
No task states. Just retry on failure.
Xorq executes expressions as Arrow RecordBatch streams. There's no DAG of tasks to checkpoint, just data flowing through operators. If something fails, rerun from the manifest. Cached nodes resolve instantly; the rest recomputes.
Scikit-learn Integration
Xorq translates scikit-learn Pipeline objects to deferred expressions:
from xorq.expr.ml.pipeline_lib import Pipeline
sklearn_pipeline = ...
xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)
Templates
Ready-to-start code as skills:
$ xorq init -t <template>
| Template | Description |
|---|---|
penguins |
Minimal example: caching, aggregation, multi-engine |
sklearn |
Classification pipeline with train/predict separation |
Skills for humans
Templates work as easy to get started components with expressions ready to be composed with your sources.
Coming Soon
feast— Feature store integrationboring-semantic-layer— Metrics and dimensions catalogdbt— dbt model composition- Feature Selection
The Horizontal Stack
Write in Python. Catalog as YAML. Compose anywhere via Ibis. Portable compute engine built on DataFusion. Universal UDFs via Arrow Flight.
Lineage, caching, and versioning travel with the manifest; cataloged, not locked in a vendor's database.
Integrations: Ibis • scikit-learn • Feast(wip) • dbt (upcoming)
Learn More
Pre-1.0. Expect breaking changes with migration guides.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xorq-0.3.17.tar.gz.
File metadata
- Download URL: xorq-0.3.17.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
742a354598f53b5d494476730e80718317fa249b417fb1ce3fdc79688ea2ec3f
|
|
| MD5 |
d1a360a189d68aa574e7200ee4d210d1
|
|
| BLAKE2b-256 |
eddfd81e96dbe7695d72ccbab9f184082b87caa34788b5a60847f68249a093e5
|
File details
Details for the file xorq-0.3.17-py3-none-any.whl.
File metadata
- Download URL: xorq-0.3.17-py3-none-any.whl
- Upload date:
- Size: 1.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
080d0134bc8f95658bcbd6e46814847466fd609cafc9cc49544da996e29f33ea
|
|
| MD5 |
97d3fff6c068795ccf69e3dc7f840044
|
|
| BLAKE2b-256 |
99f4fc4ca8ef299b11b8e9174e678e55fe3b504be3eff390b5dc510fe1e4a71c
|