Fast Postgres → Parquet sync tool
Project description
rustream
Fast Postgres to Parquet sync tool. Reads tables from Postgres, writes Parquet files to local disk or S3. Supports incremental sync via updated_at watermark tracking.
Installation
From PyPI
pipx install rustream
# or
pip install rustream
From source
git clone https://github.com/kraftaa/rustream.git
cd rustream
cargo build --release
# binary is at target/release/rustream
With maturin (local dev)
pip install maturin
maturin develop --release
# now `rustream` is on your PATH
Usage
# Copy and edit the example config
cp config.example.yaml config.yaml
# Preview what will be synced (no files written)
rustream sync --config config.yaml --dry-run
# Run sync
rustream sync --config config.yaml
Enable debug logging with RUST_LOG:
RUST_LOG=rustream=debug rustream sync --config config.yaml
Configuration
Specific tables (recommended)
postgres:
host: localhost
database: mydb
user: postgres
password: secret
output:
type: local
path: ./output
tables:
- name: users
incremental_column: updated_at
columns: # optional: pick specific columns
- id
- email
- created_at
- updated_at
- name: orders
incremental_column: updated_at
- name: products # no incremental_column = full sync every run
All tables (auto-discover)
Omit tables to sync every table in the schema. Use exclude to skip some:
postgres:
host: localhost
database: mydb
user: postgres
output:
type: local
path: ./output
# schema: public # default
exclude:
- schema_migrations
- ar_internal_metadata
S3 output
output:
type: s3
bucket: my-data-lake
prefix: raw/postgres
region: us-east-1
AWS credentials come from environment variables, ~/.aws/credentials, or IAM role.
Config reference
| Field | Description |
|---|---|
postgres.host |
Postgres host |
postgres.port |
Postgres port (default: 5432) |
postgres.database |
Database name |
postgres.user |
Database user |
postgres.password |
Database password (optional) |
output.type |
local or s3 |
output.path |
Local directory for Parquet files (when type=local) |
output.bucket |
S3 bucket (when type=s3) |
output.prefix |
S3 key prefix (when type=s3) |
output.region |
AWS region (when type=s3, optional) |
batch_size |
Rows per Parquet file (default: 10000) |
state_dir |
Directory for SQLite watermark state (default: .rustream_state) |
schema |
Schema to discover tables from (default: public) |
exclude |
List of table names to skip when using auto-discovery |
tables[].name |
Table name |
tables[].schema |
Schema name (default: public) |
tables[].columns |
Columns to sync (default: all) |
tables[].incremental_column |
Column for watermark-based incremental sync |
tables[].partition_by |
Partition output files: date, month, or year |
How it works
- Connects to Postgres and introspects each table's schema via
information_schema - Maps Postgres column types to Arrow types automatically
- Reads rows in batches, converting to Arrow RecordBatches
- Writes each batch as a Snappy-compressed Parquet file
- Tracks the high watermark (max value of
incremental_column) in local SQLite - On next run, only reads rows where
incremental_column > last_watermark
Tables without incremental_column do a full sync every run.
Supported Postgres types
| Postgres | Arrow |
|---|---|
boolean |
Boolean |
smallint |
Int16 |
integer, serial |
Int32 |
bigint, bigserial |
Int64 |
real |
Float32 |
double precision |
Float64 |
numeric / decimal |
Utf8 (preserves precision) |
text, varchar, char |
Utf8 |
bytea |
Binary |
date |
Date32 |
timestamp |
Timestamp(Microsecond) |
timestamptz |
Timestamp(Microsecond, UTC) |
uuid |
Utf8 |
json, jsonb |
Utf8 |
| arrays | Utf8 (JSON serialized) |
Publishing
The project uses maturin to package the Rust binary as a Python wheel (same approach as ruff, uv, etc). The CI workflow in .github/workflows/release.yml builds wheels for Linux, macOS, and Windows, then publishes to PyPI on tagged releases.
To publish manually:
# Build wheels for current platform
maturin build --release
# Upload to PyPI (needs PYPI_API_TOKEN)
maturin publish
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rustream-0.1.2-py3-none-win_amd64.whl.
File metadata
- Download URL: rustream-0.1.2-py3-none-win_amd64.whl
- Upload date:
- Size: 11.0 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e74aac97b54336ff451ccd663c7f9b79986313059d7fcbe99d9b8896a7ceeee
|
|
| MD5 |
46370dc107d3ac54eedc77055b6b34a2
|
|
| BLAKE2b-256 |
2f5c0ebdc7f0df249d9037d3a5bfae3bf483c9d34959052d83248c504d3cde1e
|
File details
Details for the file rustream-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: rustream-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 11.9 MB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2920f38c210a119266eb3fa16c525bb6b54fefac0e33901683a4b2323173560
|
|
| MD5 |
5c654a05ad8e81e8a4f3481424638657
|
|
| BLAKE2b-256 |
6dd400756a8491342107180d627c9e61eeb85b3b4ef99c222baf8c9070d499a5
|
File details
Details for the file rustream-0.1.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: rustream-0.1.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 11.2 MB
- Tags: Python 3, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2d3387573e5c96e2bc99b05c60a9c3cbcce5cd28ae1c703268ed48c7dd0fd0d
|
|
| MD5 |
72d479a6fc7a9c904bc2183c685ffe6f
|
|
| BLAKE2b-256 |
7a4f2ba1ed1a14dfd86a2d91bf2d82275c415e1cc0343cb4c3a322b2a93ba668
|
File details
Details for the file rustream-0.1.2-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: rustream-0.1.2-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 10.2 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e684b839d934e36babae91303fda46752ee6272ea3eb258513713b3db3a18816
|
|
| MD5 |
8fcbef5d8831b9214f82931c9f7eddf2
|
|
| BLAKE2b-256 |
91bfe880d7b5f149972a610aaa8760179779db4a5244e1a90d19c377a9763852
|
File details
Details for the file rustream-0.1.2-py3-none-macosx_10_12_x86_64.whl.
File metadata
- Download URL: rustream-0.1.2-py3-none-macosx_10_12_x86_64.whl
- Upload date:
- Size: 11.3 MB
- Tags: Python 3, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94c1869ba11cc0d92f7e3d1ade14b625cbf9a7f00900e5b59d5f92af393c96d8
|
|
| MD5 |
697e1ec3bb03224b7a5ff3b70562db8a
|
|
| BLAKE2b-256 |
33d03b2f03b868ae5b0d2dd801bf5f355c6f8ceff34cf1191ffa0edb58a4e80c
|