Skip to main content

Fast multi-backend (DuckDB / DataFusion) dataset HTTP server.

Project description

datap-rs

██████╗  █████╗ ████████╗ █████╗ ██████╗       ██████╗ ███████╗
██╔══██╗██╔══██╗╚══██╔══╝██╔══██╗██╔══██╗      ██╔══██╗██╔════╝
██║  ██║███████║   ██║   ███████║██████╔╝█████╗██████╔╝███████╗
██║  ██║██╔══██║   ██║   ██╔══██║██╔═══╝ ╚════╝██╔══██╗╚════██║
██████╔╝██║  ██║   ██║   ██║  ██║██║           ██║  ██║███████║
╚═════╝ ╚═╝  ╚═╝   ╚═╝   ╚═╝  ╚═╝╚═╝           ╚═╝  ╚═╝╚══════╝

PyPI Python

A fast multi-backend dataset HTTP server, built in Rust and driven from Python.

datap-rs (datapress) exposes one or more Parquet or Delta datasets over a small JSON HTTP API. It ships with two pluggable engines bundled into a single wheel — pick one at runtime:

  • DuckDB — battle-tested SQL, lazy parquet reads, low startup.
  • DataFusion — pure-Rust, in-memory RecordBatch + equality index for low-latency point lookups.

Identical request/response shapes across both, so you can A/B them under your real workload.


Install

pip install datap-rs
# or
uv pip install datap-rs

Wheels are published for macOS (arm64/x86_64), Linux (x86_64/aarch64) and Windows (x86_64) against CPython 3.9+ (abi3).


Quick start

For testing, we're using this kaggle US accidents 2016-2023 dataset.

import asyncio
from datap_rs.datapress import DataPress, DataPressConfig, DatasetConfig

async def main() -> None:
    ds = DatasetConfig(
        name="accidents",
        source="data/accidents.parquet",
        format="parquet",          # or "delta"
        mode="auto",               # eq-index policy: "auto" | "none" | "list"
        description="US accidents 2016-2023",
    )
    cfg = DataPressConfig(
        backend="datafusion",      # or "duckdb"
        listen="0.0.0.0",
        port=8000,
        workers=8,
    )
    server = DataPress(cfg, datasets=[ds])
    await server.run()              # blocks until SIGINT

if __name__ == "__main__":
    asyncio.run(main())

Hit it:

curl http://localhost:8000/api/datasets
curl http://localhost:8000/api/datasets/accidents/schema
curl -X POST http://localhost:8000/api/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "columns": ["ID","Severity","City","State"],
    "predicates": [
      { "col": "State",    "op": "eq",  "val": "TX" },
      { "col": "Severity", "op": "gte", "val": 3   }
    ],
    "page": 1, "page_size": 50
  }'

API surface

Four classes, no module-level state:

Class Purpose
DataPressConfig Server tuning: backend, listen, port, workers, prefix.
DatasetConfig One dataset: name, source, format, mode, optional S3 + index.
S3Config S3 / S3-compatible credentials and endpoint config.
DataPress Built from a DataPressConfig + list of DatasetConfig. await .run().

Hover any of them in your IDE for full kwarg docs.

S3 / S3-compatible sources

from datap_rs.datapress import DataPress, DataPressConfig, DatasetConfig, S3Config

s3 = S3Config(
    region="us-east-1",
    endpoint="http://localhost:9000",   # MinIO / R2 / Wasabi / Backblaze
    addressing_style="path",            # or "virtual"
    allow_http=True,                    # only for non-https endpoints
)

ds = DatasetConfig(
    name="events",
    source="s3://events/2025/",
    format="parquet",                    # or "delta"
    s3=s3,
)

Credentials fall back to the standard AWS env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_REGION) when not set inline.

Behind a reverse proxy

Set prefix to mount every route under a URL path — handy when nginx / Traefik / Caddy forwards the prefix verbatim:

DataPressConfig(backend="datafusion", port=8000, prefix="/datapress")
# → GET /datapress/api/datasets, GET /datapress/health, ...

prefix must start with / and not end with /. Empty string (default) mounts at the root.

Equality-index policy (DataFusion only)

DatasetConfig(
    name="big",
    source="data/big.parquet",
    mode="list",                                  # "auto" | "none" | "list"
    index_columns=["State", "Severity"],          # required for "list"
    index_max_cardinality=100_000,                # used by "auto"
)
  • auto — index every column whose distinct count stays below index_max_cardinality.
  • none — skip the index; every query goes through DataFusion SQL.
  • list — index only index_columns. Best for very wide datasets.

DuckDB ignores this block.


HTTP API

Same five routes for both backends.

Method Path Purpose
GET /health Liveness probe.
GET /api/datasets List configured datasets.
GET /api/datasets/{name}/schema Inferred columns + sample row.
POST /api/datasets/{name}/query Filter + paginate.
POST /api/datasets/{name}/count Total or filtered row count.
POST /api/datasets/{name}/reload Atomic dataset reload (requires admin token).

Query body

{
  "columns":   ["ID","City","State","Severity"],
  "predicates": [
    { "col": "State",    "op": "eq",  "val": "TX" },
    { "col": "Severity", "op": "gte", "val": 3   }
  ],
  "order_by": [ { "col": "Severity", "dir": "desc" } ],
  "limit":     1000,
  "page":      1,
  "page_size": 50
}
Field Type Default Notes
columns string[] [] Empty = all columns.
predicates Predicate[] [] ANDed together.
order_by OrderBy[] [] { col, dir? }; dir is asc (default) or desc.
limit int or null null Hard cap on total rows across pages.
page int >= 1 1 1-based.
page_size int 1..=1000 100 Clamped.

Predicate operators

op val Meaning
eq scalar col = val
neq scalar col <> val
gt / gte number / string col > val / col >= val
lt / lte number / string col < val / col <= val
like string with %/_ SQL LIKE
ilike string with %/_ Case-insensitive LIKE
in non-empty array col IN (v1, v2, …)
is_null omit col IS NULL
is_not_null omit col IS NOT NULL

Count body

Same predicate shape, no projection or pagination:

{ "predicates": [ { "col": "State", "op": "eq", "val": "TX" } ] }

Response: { "count": <int> }. Empty body ({}) counts every row. On materialised DataFusion datasets, the no-predicate case is O(1) and indexed eq / in predicates short-circuit through the equality index.

curl -X POST http://localhost:8000/api/datasets/accidents/count \
  -H 'Content-Type: application/json' -d '{}'
# → { "count": 7728394 }

Admin reload

POST /api/datasets/{name}/reload rebuilds a dataset from its source and atomically swaps it in. Requires the X-Admin-Token header to match the ADMIN_TOKEN env var. Endpoint is disabled when ADMIN_TOKEN is unset (secure default).

import os
os.environ["ADMIN_TOKEN"] = "supersecret"     # before constructing DataPress
curl -X POST -H "X-Admin-Token: supersecret" \
  http://localhost:8000/api/datasets/accidents/reload
# → { "dataset": "accidents", "rows": 7728394, "elapsed_ms": 1842 }

Double-buffered, zero-downtime swap. Reload builds the new dataset off to the side (parquet decode + equality-index build happen on a worker thread against the old snapshot still being served), then a single ArcSwap::store flips the pointer in the shared map. In-flight queries finish against the old Arc; the next request sees the new data. The old buffers are dropped lazily once the last reader releases its reference — no locks, no GC pause, no "loading…" window. If the rebuild fails the swap simply doesn't happen and the old snapshot stays live. Per-dataset reloads are serialised by an async mutex; reloads of different datasets run in parallel. Peak RSS roughly doubles for the dataset being reloaded while both buffers are resident.


Choosing a backend

  • DuckDB — the safe default. Handles arbitrary SQL well, manages its own buffer pool, starts up in milliseconds because it lazily reads parquet pages on demand.
  • DataFusion — pick when the data fits in RAM and you repeatedly query the same columns with equality / IN predicates; the eq-index turns those into O(1) lookups. Also produces a leaner static binary (no vendored C++).

Both engines are compiled into the same wheel — switching is one keyword argument away.


Logging

datapress initialises env_logger on import. Control verbosity with the standard RUST_LOG variable:

RUST_LOG=info  python example.py
RUST_LOG=debug python example.py

License

MIT. See LICENSE in the source repo.

Source, issue tracker and Rust crates: https://github.com/jeroenflvr/fast-api

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datap_rs-0.1.15.tar.gz (90.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datap_rs-0.1.15-cp39-abi3-win_amd64.whl (63.1 MB view details)

Uploaded CPython 3.9+Windows x86-64

datap_rs-0.1.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (66.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

datap_rs-0.1.15-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (62.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

datap_rs-0.1.15-cp39-abi3-macosx_11_0_arm64.whl (58.3 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file datap_rs-0.1.15.tar.gz.

File metadata

  • Download URL: datap_rs-0.1.15.tar.gz
  • Upload date:
  • Size: 90.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datap_rs-0.1.15.tar.gz
Algorithm Hash digest
SHA256 1a20de7664a700fac76e880964911fcba34e827807bbe15125752e0063a3d742
MD5 0eaea2717fa64651be3c18c0b681a2f8
BLAKE2b-256 c1a1ee465c3f2eb048d816564d697db972249ed0f3c2b36d1b1e4f4216efc320

See more details on using hashes here.

File details

Details for the file datap_rs-0.1.15-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: datap_rs-0.1.15-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 63.1 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datap_rs-0.1.15-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 318b2714e3ea992a293ae7d7cfc4473ee9798c786c3bc4d6f2c8ff2ab95d0a5f
MD5 5d5fde50d43cc1a9cf3edb408c9f4bf2
BLAKE2b-256 443487e64dd3f16ba510302b1a299b3ce065a8276168a14911aa7c94cd03db12

See more details on using hashes here.

File details

Details for the file datap_rs-0.1.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for datap_rs-0.1.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 91f310c6924f8f184fe0aaa2465f5f707b6f7676e3accc4c1645b72f757d63e8
MD5 a98d8af70d165772453ac453bf50b72b
BLAKE2b-256 f9559a28959ddad502da80bd4d26040faeb7228d66440103ffb9d1df6744ebd2

See more details on using hashes here.

File details

Details for the file datap_rs-0.1.15-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for datap_rs-0.1.15-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 189ba6d916629659aa427956441c49fc3b48c11ce36c97fa8322890a89c0ffa6
MD5 85e17d8cac53d569c3cfc694c0cea46e
BLAKE2b-256 a169b379450812652f65b86da710d9b51a87bb1031bb69e0a93e3be21c29127d

See more details on using hashes here.

File details

Details for the file datap_rs-0.1.15-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datap_rs-0.1.15-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2cd61653b0434d995e40cd74f04608bdcbfdf0c529db5ba6ad5ef75e72677566
MD5 607ceaed2fde8444bb37f2d8fa0b9eb3
BLAKE2b-256 d0d4cfb29f0da3a5faeb4a362c6c3589085cbff9e7c97b10cbff491ec55009ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page