Skip to main content

Fast Postgres → Parquet sync tool

Project description

rustream

Fast Postgres to Parquet sync tool. Reads tables from Postgres, writes Parquet files to local disk or S3. Supports incremental sync via updated_at watermark tracking.

Installation

From PyPI

pipx install rustream
# or
pip install rustream

From source

git clone https://github.com/kraftaa/rustream.git
cd rustream
cargo build --release
# binary is at target/release/rustream

With maturin (local dev)

pip install maturin
maturin develop --release
# now `rustream` is on your PATH

Usage

# Copy and edit the example config
cp config.example.yaml config.yaml

# Preview what will be synced (no files written)
rustream sync --config config.yaml --dry-run

# Run sync
rustream sync --config config.yaml

Enable debug logging with RUST_LOG:

RUST_LOG=rustream=debug rustream sync --config config.yaml

Configuration

Specific tables (recommended)

postgres:
  host: localhost
  database: mydb
  user: postgres
  password: secret

output:
  type: local
  path: ./output

tables:
  - name: users
    incremental_column: updated_at
    columns:          # optional: pick specific columns
      - id
      - email
      - created_at
      - updated_at

  - name: orders
    incremental_column: updated_at

  - name: products    # no incremental_column = full sync every run

All tables (auto-discover)

Omit tables to sync every table in the schema. Use exclude to skip some:

postgres:
  host: localhost
  database: mydb
  user: postgres

output:
  type: local
  path: ./output

# schema: public    # default
exclude:
  - schema_migrations
  - ar_internal_metadata

S3 output

output:
  type: s3
  bucket: my-data-lake
  prefix: raw/postgres
  region: us-east-1

AWS credentials come from environment variables, ~/.aws/credentials, or IAM role.

Config reference

Field Description
postgres.host Postgres host
postgres.port Postgres port (default: 5432)
postgres.database Database name
postgres.user Database user
postgres.password Database password (optional)
output.type local or s3
output.path Local directory for Parquet files (when type=local)
output.bucket S3 bucket (when type=s3)
output.prefix S3 key prefix (when type=s3)
output.region AWS region (when type=s3, optional)
batch_size Rows per Parquet file (default: 10000)
state_dir Directory for SQLite watermark state (default: .rustream_state)
schema Schema to discover tables from (default: public)
exclude List of table names to skip when using auto-discovery
tables[].name Table name
tables[].schema Schema name (default: public)
tables[].columns Columns to sync (default: all)
tables[].incremental_column Column for watermark-based incremental sync
tables[].partition_by Partition output files: date, month, or year

How it works

  1. Connects to Postgres and introspects each table's schema via information_schema
  2. Maps Postgres column types to Arrow types automatically
  3. Reads rows in batches, converting to Arrow RecordBatches
  4. Writes each batch as a Snappy-compressed Parquet file
  5. Tracks the high watermark (max value of incremental_column) in local SQLite
  6. On next run, only reads rows where incremental_column > last_watermark

Tables without incremental_column do a full sync every run.

Supported Postgres types

Postgres Arrow
boolean Boolean
smallint Int16
integer, serial Int32
bigint, bigserial Int64
real Float32
double precision Float64
numeric / decimal Utf8 (preserves precision)
text, varchar, char Utf8
bytea Binary
date Date32
timestamp Timestamp(Microsecond)
timestamptz Timestamp(Microsecond, UTC)
uuid Utf8
json, jsonb Utf8
arrays Utf8 (JSON serialized)

Publishing

The project uses maturin to package the Rust binary as a Python wheel (same approach as ruff, uv, etc). The CI workflow in .github/workflows/release.yml builds wheels for Linux, macOS, and Windows, then publishes to PyPI on tagged releases.

To publish manually:

# Build wheels for current platform
maturin build --release

# Upload to PyPI (needs PYPI_API_TOKEN)
maturin publish

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rustream-0.1.0-py3-none-win_amd64.whl (11.0 MB view details)

Uploaded Python 3Windows x86-64

rustream-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.9 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

rustream-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

rustream-0.1.0-py3-none-macosx_11_0_arm64.whl (10.2 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

rustream-0.1.0-py3-none-macosx_10_12_x86_64.whl (11.3 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file rustream-0.1.0-py3-none-win_amd64.whl.

File metadata

  • Download URL: rustream-0.1.0-py3-none-win_amd64.whl
  • Upload date:
  • Size: 11.0 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rustream-0.1.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 3abd9dba6c4d6976e1ba8f922b074a9f8642a4f50071f32ff5fd71335d5917b2
MD5 634b67f9ff51dba8b8d0163fe2258143
BLAKE2b-256 453cc1a579a1adbc1b89bdde777f0ee2000ff20f415519b340e2927b62f87b7c

See more details on using hashes here.

File details

Details for the file rustream-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustream-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 97004bc2194da9191f7bda7851b35701877b218a43af3a7e2663cf8e0fcc780a
MD5 a7dd7e73c625637677edde53bc994c4e
BLAKE2b-256 ba3abc563998c99d9f0bc11ec985a38efe0ca01bfd6599231912fc58e7a1bc0e

See more details on using hashes here.

File details

Details for the file rustream-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustream-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 98ce4c18a6aa5697ba8b3300b85d0d0d0bb1b162414c8d7500d319f7ecaf3426
MD5 a4d6a24bf78248cdf47d45b0cf9beebd
BLAKE2b-256 b8475f06c3ba788c91f3395821aa97a0be3ab0ac58976bc6505046ee87eafa4e

See more details on using hashes here.

File details

Details for the file rustream-0.1.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustream-0.1.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bd4cd0a0b630c1280b0f08db8e66a7294c0b7ec0228a20f547e5502b7a3ab44e
MD5 063d138378d5be60f68c0aef0993b9a8
BLAKE2b-256 97767e8f25999271c82995afe4e4e37a9fa56b6d1ee3fdbe2c82894cc79267ee

See more details on using hashes here.

File details

Details for the file rustream-0.1.0-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rustream-0.1.0-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8ce54048b23f7eac0901c64df312d21eedb03c1acf47f8c812f844cfa25319db
MD5 83afb9c6067409c09228916ec58ece8f
BLAKE2b-256 4bd93ebfdc47d4c43fe4ee88d9a29f763ebf48eb70b0e881dae3ab67a7bb795b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page