Skip to main content

Fast multi-backend (DuckDB / DataFusion) dataset HTTP server.

Project description

datapress

PyPI Python

A fast multi-backend dataset HTTP server, built in Rust and driven from Python.

datapress exposes one or more Parquet or Delta datasets over a small JSON HTTP API. It ships with two pluggable engines bundled into a single wheel — pick one at runtime:

  • DuckDB — battle-tested SQL, lazy parquet reads, low startup.
  • DataFusion — pure-Rust, in-memory RecordBatch + equality index for low-latency point lookups.

Identical request/response shapes across both, so you can A/B them under your real workload.


Install

pip install datapress
# or
uv pip install datapress

Wheels are published for macOS (arm64/x86_64), Linux (x86_64/aarch64) and Windows (x86_64) against CPython 3.9+ (abi3).


Quick start

import asyncio
from datapress import DataPress, DataPressConfig, DatasetConfig

async def main() -> None:
    ds = DatasetConfig(
        name="accidents",
        source="data/accidents.parquet",
        format="parquet",          # or "delta"
        mode="auto",               # eq-index policy: "auto" | "none" | "list"
        description="US accidents 2016-2023",
    )
    cfg = DataPressConfig(
        backend="datafusion",      # or "duckdb"
        listen="0.0.0.0",
        port=8000,
        workers=8,
    )
    server = DataPress(cfg, datasets=[ds])
    await server.run()              # blocks until SIGINT

if __name__ == "__main__":
    asyncio.run(main())

Hit it:

curl http://localhost:8000/api/datasets
curl http://localhost:8000/api/datasets/accidents/schema
curl -X POST http://localhost:8000/api/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "columns": ["ID","Severity","City","State"],
    "predicates": [
      { "col": "State",    "op": "eq",  "val": "TX" },
      { "col": "Severity", "op": "gte", "val": 3   }
    ],
    "page": 1, "page_size": 50
  }'

API surface

Four classes, no module-level state:

Class Purpose
DataPressConfig Server tuning: backend, listen, port, workers, prefix.
DatasetConfig One dataset: name, source, format, mode, optional S3 + index.
S3Config S3 / S3-compatible credentials and endpoint config.
DataPress Built from a DataPressConfig + list of DatasetConfig. await .run().

Hover any of them in your IDE for full kwarg docs.

S3 / S3-compatible sources

from datapress import DataPress, DataPressConfig, DatasetConfig, S3Config

s3 = S3Config(
    region="us-east-1",
    endpoint="http://localhost:9000",   # MinIO / R2 / Wasabi / Backblaze
    addressing_style="path",            # or "virtual"
    allow_http=True,                    # only for non-https endpoints
)

ds = DatasetConfig(
    name="events",
    source="s3://events/2025/",
    format="parquet",                    # or "delta"
    s3=s3,
)

Credentials fall back to the standard AWS env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_REGION) when not set inline.

Behind a reverse proxy

Set prefix to mount every route under a URL path — handy when nginx / Traefik / Caddy forwards the prefix verbatim:

DataPressConfig(backend="datafusion", port=8000, prefix="/datapress")
# → GET /datapress/api/datasets, GET /datapress/health, ...

prefix must start with / and not end with /. Empty string (default) mounts at the root.

Equality-index policy (DataFusion only)

DatasetConfig(
    name="big",
    source="data/big.parquet",
    mode="list",                                  # "auto" | "none" | "list"
    index_columns=["State", "Severity"],          # required for "list"
    index_max_cardinality=100_000,                # used by "auto"
)
  • auto — index every column whose distinct count stays below index_max_cardinality.
  • none — skip the index; every query goes through DataFusion SQL.
  • list — index only index_columns. Best for very wide datasets.

DuckDB ignores this block.


HTTP API

Same five routes for both backends.

Method Path Purpose
GET /health Liveness probe.
GET /api/datasets List configured datasets.
GET /api/datasets/{name}/schema Inferred columns + sample row.
POST /api/datasets/{name}/query Filter + paginate.
POST /api/datasets/{name}/count Total or filtered row count.
POST /api/datasets/{name}/reload Atomic dataset reload (requires admin token).

Query body

{
  "columns":   ["ID","City","State","Severity"],
  "predicates": [
    { "col": "State",    "op": "eq",  "val": "TX" },
    { "col": "Severity", "op": "gte", "val": 3   }
  ],
  "page":      1,
  "page_size": 50
}
Field Type Default Notes
columns string[] [] Empty = all columns.
predicates Predicate[] [] ANDed together.
page int >= 1 1 1-based.
page_size int 1..=1000 100 Clamped.

Predicate operators

op val Meaning
eq scalar col = val
neq scalar col <> val
gt / gte number / string col > val / col >= val
lt / lte number / string col < val / col <= val
like string with %/_ SQL LIKE
ilike string with %/_ Case-insensitive LIKE
in non-empty array col IN (v1, v2, …)
is_null omit col IS NULL
is_not_null omit col IS NOT NULL

Count body

Same predicate shape, no projection or pagination:

{ "predicates": [ { "col": "State", "op": "eq", "val": "TX" } ] }

Response: { "count": <int> }. Empty body ({}) counts every row. On materialised DataFusion datasets, the no-predicate case is O(1) and indexed eq / in predicates short-circuit through the equality index.

curl -X POST http://localhost:8000/api/datasets/accidents/count \
  -H 'Content-Type: application/json' -d '{}'
# → { "count": 7728394 }

Admin reload

POST /api/datasets/{name}/reload rebuilds a dataset from its source and atomically swaps it in. Requires the X-Admin-Token header to match the ADMIN_TOKEN env var. Endpoint is disabled when ADMIN_TOKEN is unset (secure default).

import os
os.environ["ADMIN_TOKEN"] = "supersecret"     # before constructing DataPress
curl -X POST -H "X-Admin-Token: supersecret" \
  http://localhost:8000/api/datasets/accidents/reload
# → { "dataset": "accidents", "rows": 7728394, "elapsed_ms": 1842 }

Choosing a backend

  • DuckDB — the safe default. Handles arbitrary SQL well, manages its own buffer pool, starts up in milliseconds because it lazily reads parquet pages on demand.
  • DataFusion — pick when the data fits in RAM and you repeatedly query the same columns with equality / IN predicates; the eq-index turns those into O(1) lookups. Also produces a leaner static binary (no vendored C++).

Both engines are compiled into the same wheel — switching is one keyword argument away.


Logging

datapress initialises env_logger on import. Control verbosity with the standard RUST_LOG variable:

RUST_LOG=info  python example.py
RUST_LOG=debug python example.py

License

MIT. See LICENSE in the source repo.

Source, issue tracker and Rust crates: https://github.com/jeroenflvr/fast-api

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datap_rs-0.1.11.tar.gz (86.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datap_rs-0.1.11-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (66.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

datap_rs-0.1.11-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (62.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

datap_rs-0.1.11-cp39-abi3-macosx_11_0_arm64.whl (58.5 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file datap_rs-0.1.11.tar.gz.

File metadata

  • Download URL: datap_rs-0.1.11.tar.gz
  • Upload date:
  • Size: 86.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datap_rs-0.1.11.tar.gz
Algorithm Hash digest
SHA256 8de7e792ae65526d1fa101f40a1c4b71263e56b9f896b83c3e4095c844a55606
MD5 70bf1d9f20bf125ae1f90f53fd79b6ba
BLAKE2b-256 0384d904f9ff74c7524e00f0bcf05f87ac5226185bbdefcc68af28df38d84674

See more details on using hashes here.

File details

Details for the file datap_rs-0.1.11-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for datap_rs-0.1.11-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 07100052a9d103766cebc78bddbfd0f9f1d5d17120cde516a388f8607c2b42d8
MD5 b78d0cc9e38812e5e38c550fb20529f2
BLAKE2b-256 5aee18509ec1849d74dbd6d5e62c0c79f0181d57852de1edd984e1c8440f03b1

See more details on using hashes here.

File details

Details for the file datap_rs-0.1.11-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for datap_rs-0.1.11-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fbae42ef0e71cb601adf7462b9705970fba40faf9f2c47da82e4c3dccde4a2a1
MD5 5fc9cc84b6674c0f372409e2839da706
BLAKE2b-256 f4b6784a829fe1d7c462085228fee6f0221bad4b525640dd73c9d75b00e40b2c

See more details on using hashes here.

File details

Details for the file datap_rs-0.1.11-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datap_rs-0.1.11-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 08f7fde5f20afde6a85480805f64057f0503594154e7054a577b144c950a593f
MD5 97eef5cc2bc2473099ad2258122c6283
BLAKE2b-256 a7ad56e675fb00b81b3c7db346e4f9d32fb542d5f2b0ef6296b47020f7afdebc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page