High-performance bulk data movement for Redshift
Project description
Arrowjet
The fastest way to move data in and out of cloud databases.
Arrowjet wraps each database's native bulk path in a simple Python API — no boilerplate, no S3 scripts, no slow INSERT loops.
pip install arrowjet # core (PostgreSQL COPY, MySQL LOAD DATA, BYOC Engine, CLI)
pip install arrowjet[redshift] # + Redshift driver (arrowjet.connect())
pip install arrowjet[full] # + Redshift + PostgreSQL + MySQL + SQLAlchemy
Supported databases:
- PostgreSQL / Aurora PostgreSQL / RDS PostgreSQL — via COPY protocol
- MySQL / Aurora MySQL / RDS MySQL / MariaDB — via LOAD DATA LOCAL INFILE
- Amazon Redshift — via COPY/UNLOAD through S3
Why Arrowjet
Standard database drivers move data row-by-row. For large datasets, this is the bottleneck — not the database.
Arrowjet uses each database's native bulk path instead. There is no slow path.
PostgreSQL — Writes (COPY FROM STDIN)
| Approach | 1M rows | vs Arrowjet |
|---|---|---|
executemany (batch 1000) |
~16 min | 850x slower |
| Multi-row VALUES (batch 1000) | 8.4s | 7.4x slower |
| Arrowjet | 1.13s | baseline |
PostgreSQL — Reads (COPY TO STDOUT)
| Approach | 1M rows | vs Arrowjet |
|---|---|---|
cursor.fetchall() |
1.00s | 1.5x slower |
| Arrowjet | 0.65s | baseline |
Benchmarked on RDS PostgreSQL 16.6, EC2 same region.
Redshift — Writes (COPY via S3)
| Approach | 1M rows | vs Arrowjet |
|---|---|---|
write_dataframe() INSERT |
13.4 hours | 14,523x slower |
executemany (batch 5000) |
27.1 hours | 29,296x slower |
| Manual COPY | 4.06s | 1.22x slower |
| Arrowjet | 3.33s | baseline |
Redshift — Reads (UNLOAD via S3)
| Approach | 1M rows | 10M rows |
|---|---|---|
cursor.fetchall() |
~11s | ~105s |
| Arrowjet | ~4s | ~34s |
Quick Start — PostgreSQL
No S3 bucket, no IAM role, no staging config. Just a psycopg2 connection.
import arrowjet
import psycopg2
conn = psycopg2.connect(host="your-host", dbname="mydb", user="user", password="...")
engine = arrowjet.Engine(provider="postgresql")
# Bulk write — 850x faster than executemany
engine.write_dataframe(conn, my_dataframe, "target_table")
# Bulk read — 1.5x faster than cursor.fetchall()
result = engine.read_bulk(conn, "SELECT * FROM events")
df = result.to_pandas()
Works with any PostgreSQL: Aurora, RDS, self-hosted, Docker, Supabase, Neon.
Quick Start — MySQL
import arrowjet
import pymysql
conn = pymysql.connect(host="your-host", database="mydb", user="user",
password="...", local_infile=True)
engine = arrowjet.Engine(provider="mysql")
# Bulk write — LOAD DATA LOCAL INFILE (100x+ faster than INSERT)
engine.write_dataframe(conn, my_dataframe, "target_table")
# Read — cursor fetch → Arrow
result = engine.read_bulk(conn, "SELECT * FROM events")
df = result.to_pandas()
Works with any MySQL: Aurora MySQL, RDS MySQL, self-hosted, MariaDB, PlanetScale, TiDB.
Quick Start — Redshift
import arrowjet
conn = arrowjet.connect(
host="your-cluster.region.redshift.amazonaws.com",
database="dev",
user="awsuser",
password="...",
staging_bucket="your-staging-bucket",
staging_iam_role="arn:aws:iam::123456789:role/RedshiftS3Role",
staging_region="us-east-1",
)
# Bulk read — UNLOAD → S3 → Parquet → Arrow
result = conn.read_bulk("SELECT * FROM events WHERE date > '2025-01-01'")
df = result.to_pandas()
# Bulk write — Arrow → Parquet → S3 → COPY
conn.write_dataframe(my_dataframe, "target_table")
# Safe mode — standard DBAPI for small queries
df = conn.fetch_dataframe("SELECT COUNT(*) FROM events")
Bring Your Own Connection
Already have connection management? Arrowjet works with your existing connections.
import arrowjet
# PostgreSQL — no staging config needed
pg_engine = arrowjet.Engine(provider="postgresql")
pg_engine.write_dataframe(existing_pg_conn, df, "my_table")
result = pg_engine.read_bulk(existing_pg_conn, "SELECT * FROM my_table")
# MySQL — no staging config needed
mysql_engine = arrowjet.Engine(provider="mysql")
mysql_engine.write_dataframe(existing_mysql_conn, df, "my_table")
result = mysql_engine.read_bulk(existing_mysql_conn, "SELECT * FROM my_table")
# Redshift — needs S3 staging config
rs_engine = arrowjet.Engine(
provider="redshift",
staging_bucket="your-bucket",
staging_iam_role="arn:aws:iam::123:role/RedshiftS3Role",
staging_region="us-east-1",
)
rs_engine.write_dataframe(existing_rs_conn, df, "my_table")
result = rs_engine.read_bulk(existing_rs_conn, "SELECT * FROM my_table")
Works with psycopg2, psycopg3, pymysql, redshift_connector, ADBC, or any DBAPI connection.
CLI
# PostgreSQL
arrowjet export --provider postgresql --query "SELECT * FROM users" --to ./users.parquet
arrowjet export --provider postgresql --query "SELECT * FROM users" --to ./users.csv --format csv
# Redshift
arrowjet export --query "SELECT * FROM sales" --to ./out.parquet
arrowjet export --query "SELECT * FROM sales" --to s3://bucket/sales/
arrowjet import --from s3://bucket/sales/ --to sales_table
arrowjet import --from ./data.parquet --to sales_table
# Inspect data
arrowjet preview --file ./out.parquet
arrowjet validate --table sales --row-count --schema --sample
# Configure profiles
arrowjet configure
All commands read connection details from ~/.arrowjet/config.yaml (set up with arrowjet configure).
Override per-command with --host, --password, --profile, --provider, etc.
See docs/cli_reference.md for full details.
Authentication
PostgreSQL
Password auth via psycopg2 connection parameters. IAM database auth for Aurora/RDS coming soon.
Redshift
Three methods supported:
- Password — default, standard credentials
- IAM —
auth_type="iam", temporary credentials viaGetClusterCredentials(provisioned) orGetCredentials(serverless) - Secrets Manager —
auth_type="secrets_manager", fetch credentials from a secret ARN
See docs/configuration.md for the full reference.
Integrations
Airflow
@task
def export_from_postgres():
import arrowjet, psycopg2
conn = psycopg2.connect(host=..., dbname=..., ...)
engine = arrowjet.Engine(provider="postgresql")
result = engine.read_bulk(conn, "SELECT * FROM events")
# write to S3, transform, etc.
conn.close()
See examples/airflow/ for Redshift examples with before/after comparison.
dbt
bash examples/dbt/run_with_arrowjet.sh
See examples/dbt/ for the full setup.
SQLAlchemy (Redshift)
from sqlalchemy import create_engine
engine = create_engine("redshift+arrowjet://user:pass@host:5439/dev")
How It Works
Each database has a fast bulk path that most Python users don't know about:
| Database | Slow path (what most people use) | Fast path (what arrowjet uses) |
|---|---|---|
| PostgreSQL | executemany(), to_sql() |
COPY FROM STDIN / COPY TO STDOUT |
| MySQL | executemany(), to_sql() |
LOAD DATA LOCAL INFILE |
| Redshift | INSERT, write_dataframe() |
COPY / UNLOAD via S3 |
Arrowjet wraps the fast path in a one-line API. There is no slow path.
Requirements
- Python 3.10+
- PostgreSQL:
psycopg2orpsycopg2-binary(any PostgreSQL — Aurora, RDS, self-hosted) - MySQL:
pymysqlwithlocal_infile=True(any MySQL — Aurora, RDS, MariaDB, self-hosted) - Redshift:
pip install arrowjet[redshift]+ S3 bucket + IAM role
See docs/iam_setup.md for Redshift IAM configuration.
Roadmap
- Redshift (COPY/UNLOAD via S3)
- PostgreSQL (COPY protocol)
- MySQL (LOAD DATA LOCAL INFILE)
- Cross-database transfer (
arrowjet.transfer()) - Snowflake (COPY INTO via stages)
- Data validation (row counts, schema checks, null detection)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arrowjet-0.4.0.tar.gz.
File metadata
- Download URL: arrowjet-0.4.0.tar.gz
- Upload date:
- Size: 86.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46a717502b29e02f62db412cb3ebbe1d60f9c8228774c14915e693dd90022ebf
|
|
| MD5 |
bd1159331feb6977ed79b03ba0a6f4b3
|
|
| BLAKE2b-256 |
2c1c3a882482b0f5a7760ef327e0f51267a0354cb52e442fa603d34d52d3acfc
|
File details
Details for the file arrowjet-0.4.0-py3-none-any.whl.
File metadata
- Download URL: arrowjet-0.4.0-py3-none-any.whl
- Upload date:
- Size: 66.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
418c7ec8ca2410b03955a3553c51fbc5cee85806ea5c5d698dbc7309c7de85b3
|
|
| MD5 |
09c77d70c66f0e3ca1e045d3c6533f62
|
|
| BLAKE2b-256 |
f460f8a29ac7e018e196c2100e657f88783b98098266b5679f1f6a8f10c579f6
|