Skip to main content

High-performance bulk data movement for Redshift

Project description

Arrowjet

The fastest way to move data in and out of cloud databases.

Arrowjet wraps each database's native bulk path in a simple Python API — no boilerplate, no S3 scripts, no slow INSERT loops.

pip install arrowjet              # core (PostgreSQL COPY, MySQL LOAD DATA, BYOC Engine, CLI)
pip install arrowjet[redshift]    # + Redshift driver (arrowjet.connect())
pip install arrowjet[full]        # + Redshift + PostgreSQL + MySQL + SQLAlchemy

Supported databases:

  • PostgreSQL / Aurora PostgreSQL / RDS PostgreSQL — via COPY protocol
  • MySQL / Aurora MySQL / RDS MySQL / MariaDB — via LOAD DATA LOCAL INFILE
  • Amazon Redshift — via COPY/UNLOAD through S3

Why Arrowjet

Standard database drivers move data row-by-row. For large datasets, this is the bottleneck — not the database.

Arrowjet uses each database's native bulk path instead. There is no slow path.

PostgreSQL — Writes (COPY FROM STDIN)

Approach 1M rows vs Arrowjet
executemany (batch 1000) ~16 min 850x slower
Multi-row VALUES (batch 1000) 8.4s 7.4x slower
Arrowjet 1.13s baseline

PostgreSQL — Reads (COPY TO STDOUT)

Approach 1M rows vs Arrowjet
cursor.fetchall() 1.00s 1.5x slower
Arrowjet 0.65s baseline

Benchmarked on RDS PostgreSQL 16.6, EC2 same region.

Redshift — Writes (COPY via S3)

Approach 1M rows vs Arrowjet
write_dataframe() INSERT 13.4 hours 14,523x slower
executemany (batch 5000) 27.1 hours 29,296x slower
Manual COPY 4.06s 1.22x slower
Arrowjet 3.33s baseline

Redshift — Reads (UNLOAD via S3)

Approach 1M rows 10M rows
cursor.fetchall() ~11s ~105s
Arrowjet ~4s ~34s

Read benchmark — 10M rows, 4-node ra3.large cluster

Write benchmark — 1M rows, 4-node ra3.large cluster


Quick Start — PostgreSQL

No S3 bucket, no IAM role, no staging config. Just a psycopg2 connection.

import arrowjet
import psycopg2

conn = psycopg2.connect(host="your-host", dbname="mydb", user="user", password="...")

engine = arrowjet.Engine(provider="postgresql")

# Bulk write — 850x faster than executemany
engine.write_dataframe(conn, my_dataframe, "target_table")

# Bulk read — 1.5x faster than cursor.fetchall()
result = engine.read_bulk(conn, "SELECT * FROM events")
df = result.to_pandas()

Works with any PostgreSQL: Aurora, RDS, self-hosted, Docker, Supabase, Neon.

Quick Start — MySQL

import arrowjet
import pymysql

conn = pymysql.connect(host="your-host", database="mydb", user="user",
                        password="...", local_infile=True)

engine = arrowjet.Engine(provider="mysql")

# Bulk write — LOAD DATA LOCAL INFILE (100x+ faster than INSERT)
engine.write_dataframe(conn, my_dataframe, "target_table")

# Read — cursor fetch → Arrow
result = engine.read_bulk(conn, "SELECT * FROM events")
df = result.to_pandas()

Works with any MySQL: Aurora MySQL, RDS MySQL, self-hosted, MariaDB, PlanetScale, TiDB.

Quick Start — Redshift

import arrowjet

conn = arrowjet.connect(
    host="your-cluster.region.redshift.amazonaws.com",
    database="dev",
    user="awsuser",
    password="...",
    staging_bucket="your-staging-bucket",
    staging_iam_role="arn:aws:iam::123456789:role/RedshiftS3Role",
    staging_region="us-east-1",
)

# Bulk read — UNLOAD → S3 → Parquet → Arrow
result = conn.read_bulk("SELECT * FROM events WHERE date > '2025-01-01'")
df = result.to_pandas()

# Bulk write — Arrow → Parquet → S3 → COPY
conn.write_dataframe(my_dataframe, "target_table")

# Safe mode — standard DBAPI for small queries
df = conn.fetch_dataframe("SELECT COUNT(*) FROM events")

Bring Your Own Connection

Already have connection management? Arrowjet works with your existing connections.

import arrowjet

# PostgreSQL — no staging config needed
pg_engine = arrowjet.Engine(provider="postgresql")
pg_engine.write_dataframe(existing_pg_conn, df, "my_table")
result = pg_engine.read_bulk(existing_pg_conn, "SELECT * FROM my_table")

# MySQL — no staging config needed
mysql_engine = arrowjet.Engine(provider="mysql")
mysql_engine.write_dataframe(existing_mysql_conn, df, "my_table")
result = mysql_engine.read_bulk(existing_mysql_conn, "SELECT * FROM my_table")

# Redshift — needs S3 staging config
rs_engine = arrowjet.Engine(
    provider="redshift",
    staging_bucket="your-bucket",
    staging_iam_role="arn:aws:iam::123:role/RedshiftS3Role",
    staging_region="us-east-1",
)
rs_engine.write_dataframe(existing_rs_conn, df, "my_table")
result = rs_engine.read_bulk(existing_rs_conn, "SELECT * FROM my_table")

Works with psycopg2, psycopg3, pymysql, redshift_connector, ADBC, or any DBAPI connection.


CLI

# PostgreSQL
arrowjet export --provider postgresql --query "SELECT * FROM users" --to ./users.parquet
arrowjet export --provider postgresql --query "SELECT * FROM users" --to ./users.csv --format csv

# Redshift
arrowjet export --query "SELECT * FROM sales" --to ./out.parquet
arrowjet export --query "SELECT * FROM sales" --to s3://bucket/sales/
arrowjet import --from s3://bucket/sales/ --to sales_table
arrowjet import --from ./data.parquet --to sales_table

# Inspect data
arrowjet preview --file ./out.parquet
arrowjet validate --table sales --row-count --schema --sample

# Configure profiles
arrowjet configure

All commands read connection details from ~/.arrowjet/config.yaml (set up with arrowjet configure). Override per-command with --host, --password, --profile, --provider, etc.

See docs/cli_reference.md for full details.


Authentication

PostgreSQL

Password auth via psycopg2 connection parameters. IAM database auth for Aurora/RDS coming soon.

Redshift

Three methods supported:

  • Password — default, standard credentials
  • IAMauth_type="iam", temporary credentials via GetClusterCredentials (provisioned) or GetCredentials (serverless)
  • Secrets Managerauth_type="secrets_manager", fetch credentials from a secret ARN

See docs/configuration.md for the full reference.


Integrations

Airflow

@task
def export_from_postgres():
    import arrowjet, psycopg2
    conn = psycopg2.connect(host=..., dbname=..., ...)
    engine = arrowjet.Engine(provider="postgresql")
    result = engine.read_bulk(conn, "SELECT * FROM events")
    # write to S3, transform, etc.
    conn.close()

See examples/airflow/ for Redshift examples with before/after comparison.

dbt

bash examples/dbt/run_with_arrowjet.sh

See examples/dbt/ for the full setup.

SQLAlchemy (Redshift)

from sqlalchemy import create_engine
engine = create_engine("redshift+arrowjet://user:pass@host:5439/dev")

How It Works

Each database has a fast bulk path that most Python users don't know about:

Database Slow path (what most people use) Fast path (what arrowjet uses)
PostgreSQL executemany(), to_sql() COPY FROM STDIN / COPY TO STDOUT
MySQL executemany(), to_sql() LOAD DATA LOCAL INFILE
Redshift INSERT, write_dataframe() COPY / UNLOAD via S3

Arrowjet wraps the fast path in a one-line API. There is no slow path.


Requirements

  • Python 3.10+
  • PostgreSQL: psycopg2 or psycopg2-binary (any PostgreSQL — Aurora, RDS, self-hosted)
  • MySQL: pymysql with local_infile=True (any MySQL — Aurora, RDS, MariaDB, self-hosted)
  • Redshift: pip install arrowjet[redshift] + S3 bucket + IAM role

See docs/iam_setup.md for Redshift IAM configuration.


Roadmap

  • Redshift (COPY/UNLOAD via S3)
  • PostgreSQL (COPY protocol)
  • MySQL (LOAD DATA LOCAL INFILE)
  • Cross-database transfer (arrowjet.transfer())
  • Snowflake (COPY INTO via stages)
  • Data validation (row counts, schema checks, null detection)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arrowjet-0.4.0.tar.gz (86.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arrowjet-0.4.0-py3-none-any.whl (66.9 kB view details)

Uploaded Python 3

File details

Details for the file arrowjet-0.4.0.tar.gz.

File metadata

  • Download URL: arrowjet-0.4.0.tar.gz
  • Upload date:
  • Size: 86.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for arrowjet-0.4.0.tar.gz
Algorithm Hash digest
SHA256 46a717502b29e02f62db412cb3ebbe1d60f9c8228774c14915e693dd90022ebf
MD5 bd1159331feb6977ed79b03ba0a6f4b3
BLAKE2b-256 2c1c3a882482b0f5a7760ef327e0f51267a0354cb52e442fa603d34d52d3acfc

See more details on using hashes here.

File details

Details for the file arrowjet-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: arrowjet-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 66.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for arrowjet-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 418c7ec8ca2410b03955a3553c51fbc5cee85806ea5c5d698dbc7309c7de85b3
MD5 09c77d70c66f0e3ca1e045d3c6533f62
BLAKE2b-256 f460f8a29ac7e018e196c2100e657f88783b98098266b5679f1f6a8f10c579f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page