Fast multi-backend (DuckDB / DataFusion) dataset HTTP server.
Project description
datap-rs
██████╗ █████╗ ████████╗ █████╗ ██████╗ ██████╗ ███████╗
██╔══██╗██╔══██╗╚══██╔══╝██╔══██╗██╔══██╗ ██╔══██╗██╔════╝
██║ ██║███████║ ██║ ███████║██████╔╝█████╗██████╔╝███████╗
██║ ██║██╔══██║ ██║ ██╔══██║██╔═══╝ ╚════╝██╔══██╗╚════██║
██████╔╝██║ ██║ ██║ ██║ ██║██║ ██║ ██║███████║
╚═════╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝╚══════╝
A fast multi-backend dataset HTTP server, built in Rust and driven from Python.
datap-rs (datapress) exposes one or more Parquet or Delta datasets over a small
JSON HTTP API. It ships with two pluggable engines bundled into a single
wheel — pick one at runtime:
- DuckDB — battle-tested SQL, lazy parquet reads, low startup.
- DataFusion — pure-Rust, in-memory
RecordBatch+ equality index for low-latency point lookups.
Identical request/response shapes across both, so you can A/B them under your real workload.
Install
pip install datap-rs
# or
uv pip install datap-rs
Wheels are published for macOS (arm64/x86_64), Linux (x86_64/aarch64) and Windows (x86_64) against CPython 3.9+ (abi3).
Quick start
For testing, we're using this kaggle US accidents 2016-2023 dataset.
import asyncio
from datap_rs.datapress import DataPress, DataPressConfig, DatasetConfig
async def main() -> None:
ds = DatasetConfig(
name="accidents",
source="data/accidents.parquet",
format="parquet", # or "delta"
mode="auto", # eq-index policy: "auto" | "none" | "list"
description="US accidents 2016-2023",
)
cfg = DataPressConfig(
backend="datafusion", # or "duckdb"
listen="0.0.0.0",
port=8000,
workers=8,
)
server = DataPress(cfg, datasets=[ds])
await server.run() # blocks until SIGINT
if __name__ == "__main__":
asyncio.run(main())
Hit it:
curl http://localhost:8000/api/datasets
curl http://localhost:8000/api/datasets/accidents/schema
curl -X POST http://localhost:8000/api/datasets/accidents/query \
-H 'Content-Type: application/json' \
-d '{
"columns": ["ID","Severity","City","State"],
"predicates": [
{ "col": "State", "op": "eq", "val": "TX" },
{ "col": "Severity", "op": "gte", "val": 3 }
],
"page": 1, "page_size": 50
}'
API surface
Four classes, no module-level state:
| Class | Purpose |
|---|---|
DataPressConfig |
Server tuning: backend, listen, port, workers, prefix. |
DatasetConfig |
One dataset: name, source, format, mode, optional S3 + index. |
S3Config |
S3 / S3-compatible credentials and endpoint config. |
DataPress |
Built from a DataPressConfig + list of DatasetConfig. await .run(). |
Hover any of them in your IDE for full kwarg docs.
S3 / S3-compatible sources
from datap_rs.datapress import DataPress, DataPressConfig, DatasetConfig, S3Config
s3 = S3Config(
region="us-east-1",
endpoint="http://localhost:9000", # MinIO / R2 / Wasabi / Backblaze
addressing_style="path", # or "virtual"
allow_http=True, # only for non-https endpoints
)
ds = DatasetConfig(
name="events",
source="s3://events/2025/",
format="parquet", # or "delta"
s3=s3,
)
Credentials fall back to the standard AWS env vars
(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN,
AWS_REGION) when not set inline.
Behind a reverse proxy
Set prefix to mount every route under a URL path — handy when nginx /
Traefik / Caddy forwards the prefix verbatim:
DataPressConfig(backend="datafusion", port=8000, prefix="/datapress")
# → GET /datapress/api/datasets, GET /datapress/health, ...
prefix must start with / and not end with /. Empty string (default)
mounts at the root.
Equality-index policy (DataFusion only)
DatasetConfig(
name="big",
source="data/big.parquet",
mode="list", # "auto" | "none" | "list"
index_columns=["State", "Severity"], # required for "list"
index_max_cardinality=100_000, # used by "auto"
)
auto— index every column whose distinct count stays belowindex_max_cardinality.none— skip the index; every query goes through DataFusion SQL.list— index onlyindex_columns. Best for very wide datasets.
DuckDB ignores this block.
HTTP API
Same five routes for both backends.
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
Liveness probe. |
| GET | /api/datasets |
List configured datasets. |
| GET | /api/datasets/{name}/schema |
Inferred columns + sample row. |
| POST | /api/datasets/{name}/query |
Filter + paginate. |
| POST | /api/datasets/{name}/count |
Total or filtered row count. |
| POST | /api/datasets/{name}/reload |
Atomic dataset reload (requires admin token). |
Query body
{
"columns": ["ID","City","State","Severity"],
"predicates": [
{ "col": "State", "op": "eq", "val": "TX" },
{ "col": "Severity", "op": "gte", "val": 3 }
],
"page": 1,
"page_size": 50
}
| Field | Type | Default | Notes |
|---|---|---|---|
columns |
string[] |
[] |
Empty = all columns. |
predicates |
Predicate[] |
[] |
ANDed together. |
page |
int >= 1 |
1 |
1-based. |
page_size |
int 1..=1000 |
100 |
Clamped. |
Predicate operators
op |
val |
Meaning |
|---|---|---|
eq |
scalar | col = val |
neq |
scalar | col <> val |
gt / gte |
number / string | col > val / col >= val |
lt / lte |
number / string | col < val / col <= val |
like |
string with %/_ |
SQL LIKE |
ilike |
string with %/_ |
Case-insensitive LIKE |
in |
non-empty array | col IN (v1, v2, …) |
is_null |
omit | col IS NULL |
is_not_null |
omit | col IS NOT NULL |
Count body
Same predicate shape, no projection or pagination:
{ "predicates": [ { "col": "State", "op": "eq", "val": "TX" } ] }
Response: { "count": <int> }. Empty body ({}) counts every row. On
materialised DataFusion datasets, the no-predicate case is O(1) and indexed
eq / in predicates short-circuit through the equality index.
curl -X POST http://localhost:8000/api/datasets/accidents/count \
-H 'Content-Type: application/json' -d '{}'
# → { "count": 7728394 }
Admin reload
POST /api/datasets/{name}/reload rebuilds a dataset from its source and
atomically swaps it in. Requires the X-Admin-Token header to match the
ADMIN_TOKEN env var. Endpoint is disabled when ADMIN_TOKEN is unset
(secure default).
import os
os.environ["ADMIN_TOKEN"] = "supersecret" # before constructing DataPress
curl -X POST -H "X-Admin-Token: supersecret" \
http://localhost:8000/api/datasets/accidents/reload
# → { "dataset": "accidents", "rows": 7728394, "elapsed_ms": 1842 }
Double-buffered, zero-downtime swap. Reload builds the new dataset
off to the side (parquet decode + equality-index build happen on a
worker thread against the old snapshot still being served), then a
single ArcSwap::store flips the pointer in the shared map. In-flight
queries finish against the old Arc; the next request sees the new
data. The old buffers are dropped lazily once the last reader releases
its reference — no locks, no GC pause, no "loading…" window. If the
rebuild fails the swap simply doesn't happen and the old snapshot stays
live. Per-dataset reloads are serialised by an async mutex; reloads of
different datasets run in parallel. Peak RSS roughly doubles for the
dataset being reloaded while both buffers are resident.
Choosing a backend
- DuckDB — the safe default. Handles arbitrary SQL well, manages its own buffer pool, starts up in milliseconds because it lazily reads parquet pages on demand.
- DataFusion — pick when the data fits in RAM and you repeatedly query
the same columns with equality /
INpredicates; the eq-index turns those into O(1) lookups. Also produces a leaner static binary (no vendored C++).
Both engines are compiled into the same wheel — switching is one keyword argument away.
Logging
datapress initialises env_logger on import. Control verbosity with the
standard RUST_LOG variable:
RUST_LOG=info python example.py
RUST_LOG=debug python example.py
License
MIT. See LICENSE in the source repo.
Source, issue tracker and Rust crates: https://github.com/jeroenflvr/fast-api
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datap_rs-0.1.12.tar.gz.
File metadata
- Download URL: datap_rs-0.1.12.tar.gz
- Upload date:
- Size: 88.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3ab9f9ec8dfaee95753c627211e1e8e760f8c74bd1c040cdac46f84f9c6bd5a
|
|
| MD5 |
cffa8b7421e546d5977d0c1f010aa709
|
|
| BLAKE2b-256 |
2b64160f0d40a1277eb359c7c5464fa4736fbe42025813152cf14a4409c5cf22
|
File details
Details for the file datap_rs-0.1.12-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: datap_rs-0.1.12-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 66.3 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c91ff2e34fae62f2aa0795789fe13a7fa3affa909a69b4e86c8e0071a95d290
|
|
| MD5 |
0e201fad11ce201c00b5ef0ee582d9d7
|
|
| BLAKE2b-256 |
47aa4425062e0feacd284099a23cd02370f64a2f18d237169477e7ab7b69eb95
|
File details
Details for the file datap_rs-0.1.12-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: datap_rs-0.1.12-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 62.4 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
313d50ea53428ac595ad6e63bfaf900c13373fc5946a6b9727282b0c3b510028
|
|
| MD5 |
f8cd9097833030db4d5bc8dd9f507bcf
|
|
| BLAKE2b-256 |
ae5991aa30a97c3d1b15b83ba05a281dfcf695a152d6351e7002b86374e6e277
|
File details
Details for the file datap_rs-0.1.12-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: datap_rs-0.1.12-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 58.5 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ab2fe227a69928cf4a92050d877eedb8583090815423e77eb614cbaff6d93ee
|
|
| MD5 |
8c3a1e238cd73ff756c487bb3c1b4c00
|
|
| BLAKE2b-256 |
1874519694093b3f2fe4cf097ab990c3fa9e13bc833bca5285c43387c769e501
|