Skip to main content

Data processing library built on top of Ibis and DataFusion to write multi-engine data workflows.

Project description

Xorq Logo Xorq Logo

License PyPI - Version CI Status

A compute manifest and composable tools for ML.

DocumentationWebsite


The Problem

You write a feature pipeline. It works on your laptop with DuckDB. Deploying it to Snowflake ends up in a rewrite. Intermediate results should be cached so you add infrastructure and a result naming system. A requirement to track pipeline changes is introduced, so you add a metadata store. Congrats, you're going to production! It's time to add a serving layer ...

Six months later: five tools that don't talk to each other and a pipeline only one person understands

Pain Symptom
Glue code everywhere Each engine is a silo. Moving between them means rewriting, not composing.
Runtime Feedback Imperative Python code where you can only tell if something will fail while running the job.
Unnecessary recomputations No shared understanding of what changed. Everything runs from scratch.
Opaque Lineages Feature logic, metadata, lineage. All in different systems. Debugging means archaeology.
"Works on my machine" Environments drift. Reproducing results means reverse engineering someone's setup and interrogating your own.
Stateful orchestrators Retry logic, task states, failure recovery. Another system to manage, another thing that breaks.

Feature stores, Model registries, Orchestrators: Vertical silos that don't serve agentic processes, which need context and skills, not categories.

Xorq

intro intro

Manifest = Context. Every ML computation becomes a structured, input-addressed YAML manifest.

Exprs = Tools. A catalog to discover. A build system to deterministically execute anywhere with user directed caching.

Templates = Skills. Various skills to get started e.g. scikit-learn pipeline, feature stores, semantic layers etc.

$ pip install xorq[examples]
$ xorq init -t penguins

The Expression

Write declarative Ibis expressions that can be run like a tool. Xorq extends Ibis with caching, multi-engine execution, and UDFs.

import ibis
import xorq.api as xo
from xorq.common.utils.ibis_utils import from_ibis
from xorq.caching import ParquetCache

penguins = ibis.examples.penguins.fetch()

penguins_agg = (
    penguins
    .filter(ibis._.species.notnull())
    .group_by("species")
    .agg(avg_bill_length=ibis._.bill_length_mm.mean())
)

expr = (
    from_ibis(penguins_agg)
    .cache(ParquetCache.from_kwargs())
)

Declare .cache() on any node. Xorq handles the rest. No cache keys to generate or manage, no invalidation logic to write.

Compose across engines

One expression, many engines. Part of your pipeline runs on DuckDB, part on Xorq's embedded DataFusion engine, UDFs via Arrow Flight. Xorq systematically handles data transit with low overhead. Bye bye glue code.

expr = from_ibis(penguins).into_backend(xo.sqlite.connect())
expr.ls.backends
(<xorq.backends.sqlite.Backend at 0x7926a815caa0>,
 <xorq.backends.duckdb.Backend at 0x7926b409faa0>)

Expressions are tools, Arrow is the pipe

Unix gave us small programs that compose via stdout. Xorq gives you expressions that compose via Arrow.

In [6]: expr.to_pyarrow_batches()
Out[6]: <pyarrow.lib.RecordBatchReader at 0x15dc3f570>

The Manifest

Build an expression, get a manifest.

$ xorq build expr.py
builds/28ecab08754e
$ tree builds/28ecab08754e
builds/28ecab08754e
├── database_tables
│   └── f2ac274df56894cb1505bfe8cb03940e.parquet
├── expr.yaml
├── metadata.json
└── profiles.yaml

No external metadata store. No separate lineage tool. The build directory is the versioned, cached, portable artifact.

# Input-addressed, composable, portable
# Abridged expr.yaml
nodes:
  '@read_31f0a5be3771':
    op: Read
    name: penguins
    source: builds/28ecab08754e/.../f2ac274df56894cb1505bfe8cb03940e.parquet

  '@filter_23e7692b7128':
    op: Filter
    parent: '@read_31f0a5be3771'
    predicates:
      - NotNull(species)

  '@remotetable_9a92039564d4':
    op: RemoteTable
    remote_expr:
      op: Aggregate
      parent: '@filter_23e7692b7128'
      by: [species]
      metrics:
        avg_bill_length: Mean(bill_length_mm)

  '@cachednode_e7b5fd7cd0a9':
    op: CachedNode
    parent: '@remotetable_9a92039564d4'
    cache:
      type: ParquetCache
      path: parquet

Reproducible builds

The manifest is roundtrippable and machine-writeable. Git-diff your pipelines. Code review your features. Track python dependencies. Rebuild from YAML alone.

$ xorq uv-build expr.py
builds/28ecab08754e/

$ ls builds/28ecab08754e/*.tar.gz
builds/28ecab08754e/sdist.tar.gz  builds/28ecab08754e/my-pipeline-0.1.0.tar.gz

The build captures everything: expression graph, dependencies, memory tables. Share the build that has sdist, get identical results. No "works on my machine."

Only recompute what changed

The manifest is input-addressed: same inputs = same hash. Change an input, get a new hash.

expr.ls.get_cache_paths()
(PosixPath('/home/user/.cache/xorq/parquet/letsql_cache-7c3df7ccce5ed4b64c02fbf8af462e70.parquet'),)

The hash is the cache key. No invalidation logic to debug. If the expression is the same, the hash is the same, and the cache is valid. Change an input, get a new hash, trigger recomputation.

Traditional caching asks "has this expired?" Input-addressed caching asks "is this the same computation?" The second question has a deterministic answer.


The Tools

The manifest provides context. The tools provide skills: catalog, introspect, serve, execute.

Catalog

# Add to catalog
$ xorq catalog add builds/28ecab08754e/ --alias penguins-agg
Added build 28ecab08754e as entry a498016e-5bea-4036-aec0-a6393d1b7c0f revision r1

# List entries
$ xorq catalog ls
Aliases:
penguins-agg    a498016e-5bea-4036-aec0-a6393d1b7c0f    r1
Entries:
a498016e-5bea-4036-aec0-a6393d1b7c0f    r1      28ecab08754e

Run

$ xorq run builds/28ecab08754e -o out.parquet

Serve

Serve expressions anywhere via Arrow Flight:

$ xorq serve-unbound builds/28ecab08754e/ \
  --to_unbind_hash 31f0a5be37713fe2c1a2d8ad8fdea69f \
  --host localhost --port 9002
import xorq.api as xo

backend = xo.flight.connect(host="localhost", port=9002)
f = backend.get_exchange("default")

data = {
    "species": ["Adelie", "Gentoo", "Chinstrap"],
    "island": ["Torgersen", "Biscoe", "Dream"],
    "bill_length_mm": [39.1, 47.5, 49.0],
    "bill_depth_mm": [18.7, 14.2, 18.5],
    "flipper_length_mm": [181, 217, 195],
    "body_mass_g": [3750, 5500, 4200],
    "sex": ["male", "female", "male"],
    "year": [2007, 2008, 2009],
}

xo.memtable(data).pipe(f).execute()
     species  avg_bill_length
0     Adelie             39.1
1  Chinstrap             49.0
2     Gentoo             47.5

Debug with confidence

No more archaeology. Lineage is encoded in the manifest—not scattered across tools—and queryable from the CLI.

$ xorq lineage penguins-agg

Lineage for column 'avg_bill_length':
Field:avg_bill_length #1
└── Cache xorq_cached_node_name_placeholder #2
    └── RemoteTable:236af67d399a4caaf17e0bf5e1ac4c0f #3
        └── Aggregate #4
            ├── Filter #5
               ├── Read #6
               └── NotNull #7
                   └── Field:species #8
                       └──  see #6
            ├── Field:species #9
               └──  see #5
            └── Mean #10
                └── Field:bill_length_mm #11
                    └──  see #5

Workflows, without state

No task states. Just retry on failure.

Xorq executes expressions as Arrow RecordBatch streams. There's no DAG of tasks to checkpoint, just data flowing through operators. If something fails, rerun from the manifest. Cached nodes resolve instantly; the rest recomputes.

Scikit-learn Integration

Xorq translates scikit-learn Pipeline objects to deferred expressions:

from xorq.expr.ml.pipeline_lib import Pipeline

sklearn_pipeline = ...
xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)

Templates

Ready-to-start code as skills:

$ xorq init -t <template>
Template Description
penguins Minimal example: caching, aggregation, multi-engine
sklearn Classification pipeline with train/predict separation

Skills for humans

Templates work as easy to get started components with expressions ready to be composed with your sources.

Coming Soon

  • feast — Feature store integration
  • boring-semantic-layer — Metrics and dimensions catalog
  • dbt — dbt model composition
  • Feature Selection

The Horizontal Stack

Write in Python. Catalog as YAML. Compose anywhere via Ibis. Portable compute engine built on DataFusion. Universal UDFs via Arrow Flight.

Architecture Architecture

Lineage, caching, and versioning travel with the manifest; cataloged, not locked in a vendor's database.

Integrations: Ibis • scikit-learn • Feast(wip) • dbt (upcoming)


Learn More


Pre-1.0. Expect breaking changes with migration guides.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xorq-0.3.17.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xorq-0.3.17-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file xorq-0.3.17.tar.gz.

File metadata

  • Download URL: xorq-0.3.17.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for xorq-0.3.17.tar.gz
Algorithm Hash digest
SHA256 742a354598f53b5d494476730e80718317fa249b417fb1ce3fdc79688ea2ec3f
MD5 d1a360a189d68aa574e7200ee4d210d1
BLAKE2b-256 eddfd81e96dbe7695d72ccbab9f184082b87caa34788b5a60847f68249a093e5

See more details on using hashes here.

File details

Details for the file xorq-0.3.17-py3-none-any.whl.

File metadata

  • Download URL: xorq-0.3.17-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for xorq-0.3.17-py3-none-any.whl
Algorithm Hash digest
SHA256 080d0134bc8f95658bcbd6e46814847466fd609cafc9cc49544da996e29f33ea
MD5 97d3fff6c068795ccf69e3dc7f840044
BLAKE2b-256 99f4fc4ca8ef299b11b8e9174e678e55fe3b504be3eff390b5dc510fe1e4a71c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page