Skip to main content

Fast Postgres → Parquet sync tool

Project description

rustream

Fast Postgres to Parquet sync tool. Reads tables from Postgres, writes Parquet files to local disk or S3. Supports incremental sync via updated_at watermark tracking.

Installation

From PyPI

pipx install rustream
# or
pip install rustream

From source

git clone https://github.com/kraftaa/rustream.git
cd rustream
cargo build --release
# binary is at target/release/rustream

With maturin (local dev)

pip install maturin
maturin develop --release
# now `rustream` is on your PATH

Usage

# Copy and edit the example config
cp config.example.yaml config.yaml

# Preview what will be synced (no files written)
rustream sync --config config.yaml --dry-run

# Run sync
rustream sync --config config.yaml

Enable debug logging with RUST_LOG:

RUST_LOG=rustream=debug rustream sync --config config.yaml

Configuration

Specific tables (recommended)

postgres:
  host: localhost
  database: mydb
  user: postgres
  password: secret

output:
  type: local
  path: ./output

tables:
  - name: users
    incremental_column: updated_at
    columns:          # optional: pick specific columns
      - id
      - email
      - created_at
      - updated_at

  - name: orders
    incremental_column: updated_at

  - name: products    # no incremental_column = full sync every run

All tables (auto-discover)

Omit tables to sync every table in the schema. Use exclude to skip some:

postgres:
  host: localhost
  database: mydb
  user: postgres

output:
  type: local
  path: ./output

# schema: public    # default
exclude:
  - schema_migrations
  - ar_internal_metadata

S3 output

output:
  type: s3
  bucket: my-data-lake
  prefix: raw/postgres
  region: us-east-1

AWS credentials come from environment variables, ~/.aws/credentials, or IAM role.

Config reference

Field Description
postgres.host Postgres host
postgres.port Postgres port (default: 5432)
postgres.database Database name
postgres.user Database user
postgres.password Database password (optional)
output.type local or s3
output.path Local directory for Parquet files (when type=local)
output.bucket S3 bucket (when type=s3)
output.prefix S3 key prefix (when type=s3)
output.region AWS region (when type=s3, optional)
batch_size Rows per Parquet file (default: 10000)
state_dir Directory for SQLite watermark state (default: .rustream_state)
schema Schema to discover tables from (default: public)
exclude List of table names to skip when using auto-discovery
tables[].name Table name
tables[].schema Schema name (default: public)
tables[].columns Columns to sync (default: all)
tables[].incremental_column Column for watermark-based incremental sync
tables[].partition_by Partition output files: date, month, or year

How it works

  1. Connects to Postgres and introspects each table's schema via information_schema
  2. Maps Postgres column types to Arrow types automatically
  3. Reads rows in batches, converting to Arrow RecordBatches
  4. Writes each batch as a Snappy-compressed Parquet file
  5. Tracks the high watermark (max value of incremental_column) in local SQLite
  6. On next run, only reads rows where incremental_column > last_watermark

Tables without incremental_column do a full sync every run.

Supported Postgres types

Postgres Arrow
boolean Boolean
smallint Int16
integer, serial Int32
bigint, bigserial Int64
real Float32
double precision Float64
numeric / decimal Utf8 (preserves precision)
text, varchar, char Utf8
bytea Binary
date Date32
timestamp Timestamp(Microsecond)
timestamptz Timestamp(Microsecond, UTC)
uuid Utf8
json, jsonb Utf8
arrays Utf8 (JSON serialized)

Publishing

The project uses maturin to package the Rust binary as a Python wheel (same approach as ruff, uv, etc). The CI workflow in .github/workflows/release.yml builds wheels for Linux, macOS, and Windows, then publishes to PyPI on tagged releases.

To publish manually:

# Build wheels for current platform
maturin build --release

# Upload to PyPI (needs PYPI_API_TOKEN)
maturin publish

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rustream-0.1.2-py3-none-win_amd64.whl (11.0 MB view details)

Uploaded Python 3Windows x86-64

rustream-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.9 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

rustream-0.1.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

rustream-0.1.2-py3-none-macosx_11_0_arm64.whl (10.2 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

rustream-0.1.2-py3-none-macosx_10_12_x86_64.whl (11.3 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file rustream-0.1.2-py3-none-win_amd64.whl.

File metadata

  • Download URL: rustream-0.1.2-py3-none-win_amd64.whl
  • Upload date:
  • Size: 11.0 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rustream-0.1.2-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 4e74aac97b54336ff451ccd663c7f9b79986313059d7fcbe99d9b8896a7ceeee
MD5 46370dc107d3ac54eedc77055b6b34a2
BLAKE2b-256 2f5c0ebdc7f0df249d9037d3a5bfae3bf483c9d34959052d83248c504d3cde1e

See more details on using hashes here.

File details

Details for the file rustream-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustream-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a2920f38c210a119266eb3fa16c525bb6b54fefac0e33901683a4b2323173560
MD5 5c654a05ad8e81e8a4f3481424638657
BLAKE2b-256 6dd400756a8491342107180d627c9e61eeb85b3b4ef99c222baf8c9070d499a5

See more details on using hashes here.

File details

Details for the file rustream-0.1.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustream-0.1.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a2d3387573e5c96e2bc99b05c60a9c3cbcce5cd28ae1c703268ed48c7dd0fd0d
MD5 72d479a6fc7a9c904bc2183c685ffe6f
BLAKE2b-256 7a4f2ba1ed1a14dfd86a2d91bf2d82275c415e1cc0343cb4c3a322b2a93ba668

See more details on using hashes here.

File details

Details for the file rustream-0.1.2-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustream-0.1.2-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e684b839d934e36babae91303fda46752ee6272ea3eb258513713b3db3a18816
MD5 8fcbef5d8831b9214f82931c9f7eddf2
BLAKE2b-256 91bfe880d7b5f149972a610aaa8760179779db4a5244e1a90d19c377a9763852

See more details on using hashes here.

File details

Details for the file rustream-0.1.2-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rustream-0.1.2-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 94c1869ba11cc0d92f7e3d1ade14b625cbf9a7f00900e5b59d5f92af393c96d8
MD5 697e1ec3bb03224b7a5ff3b70562db8a
BLAKE2b-256 33d03b2f03b868ae5b0d2dd801bf5f355c6f8ceff34cf1191ffa0edb58a4e80c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page