Load pandas DataFrames into Amazon Redshift without S3 - using the somewhat-efficient multi-row INSERT statement AWS Redshift offers

These details have not been verified by PyPI

Project links

Project description

Frameshift

Load pandas DataFrames directly into Amazon Redshift — no S3 required.

Look, we all know COPY from S3 is the right way to load data into Redshift. It's fast, it's highly parallel, it's what Amazon designed. But sometimes life doesn't cooperate:

Your VPC has no S3 access (yep. thanks security team)
You're behind an air-gapped network
Someone forgot to give you the S3 credentials
You just need to load 50K rows ONE TIME and don't want to spin up an entire S3 pipeline

Enter Frameshift: the "I just need to get this data into Redshift (and I'm in no hurry)" solution.

The Honest Truth

Frameshift is 10-20x slower than COPY. There, we said it.

Even tried being clever with 16 parallel threads, mimicking Redshift's MD5 hash distribution pattern. Turns out Redshift's leader node just laughs at your multi-threading ambitions and processes everything sequentially anyway. We settled on 4 threads which helps somewhat, but don't expect miracles.

TEST RESULTS (8 RPU Redshift Serverless

Method	Time (250K rows)	Rows/sec
Frameshift	~120-200s	1,200-2,000
COPY from S3	~10s	25,000+

Yeah. It's slow AF. But when you can't use S3, you can't use S3.

Installation

pip install frameshift

Quick Start

import pandas as pd
from frameshift import FrameShift

df = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice Cooper', 'Bob Rock', 'Chuck Biscuits'],
    'royalties': [1000000, 75000000, 5550]
})

with FrameShift(
    host='cluster.region.redshift.amazonaws.com',
    database='mydb',
    user='admin',
    password='secret'
) as fs:
    result = fs.load(df, 'users')
    print(f"Loaded {result.rows_loaded} rows")

That's it. Frameshift handles:

Creating the table if it doesn't exist (because you're lazy)
Inferring column types from pandas (magical!)
Chunking data to maximally fit Redshift's 16MB Multi-Row-Insert statement limit (boring but necessary)
Running 4 parallel connections (marginally faster than 1, we tried harder)

Options You Might Actually Use

Replace vs Append

# Nuke and pave (drop table, recreate)
fs.load(df, 'users', if_exists='replace')

# Just add more data (default)
fs.load(df, 'users', if_exists='append')

Distribution Keys (if you care about performance)

fs.load(
    df,
    'events',
    distkey='user_id',      # JOINs go brrr
    sortkey='event_time',   # Range queries go zoom
)

Progress Tracking (for the impatient)

def progress(done, total, chunk):
    print(f"{done}/{total} rows ({100*done/total:.0f}%)")

fs.load(df, 'large_table', progress_callback=progress)

Advanced Configs (You Probably Don't Need This)

from frameshift import FrameShift, FrameShiftConfig

config = FrameShiftConfig(
    parallel_threads=4,     # Even 4 is pushing it - 2 threads is prob the practical max
    batch_size=5000,        # Optional: force rows-per-INSERT (default: auto-calculated)
)

fs = FrameShift(host=..., config=config)

Chunk sizing is automatic by default - Frameshift samples ~100 rows to estimate average row size, then calculates the optimal rows-per-INSERT to stay safely under 16MB. Use batch_size only if you want to override this.

More Advanced Use Cases

For those who read documentation all the way through, here's more stuff you can do.

Check for Duplicates Before Loading

# Validate before you regret
validation = fs.validate_unique_key(df, 'order_id')
if not validation.is_unique:
    print(f"Uh oh: {validation.duplicate_count} duplicates found")
    print(validation.sample_duplicates)  # Show the offenders

# Or just let load() yell at you
fs.load(df, 'orders', unique_key='order_id', validate_unique=True)

Analyze Your DISTKEY Before You Commit

Will your data be evenly distributed, or will one slice get 90% of the rows? Find out before Redshift performance tanks:

# Check a single column
analysis = fs.analyze_distribution(df, 'customer_id', slice_count=16)
print(f"Skew ratio: {analysis.skew_ratio:.2f}x")  # 1.0 = perfect, 10.0 = yikes
print(f"Good DISTKEY? {analysis.is_good_distkey()}")

# Compare candidates (because you're indecisive)
comparison = fs.compare_distkeys(df, ['user_id', 'region', 'status'])
print(comparison)  # Shows which column has the best distribution

Find Natural Keys (Let the Computer Do the Work you Dont Feel Like Doing)

Don't know which columns are unique? Let Frameshift figure it out:

natural_keys = fs.find_natural_keys(df, max_columns=3)
for columns, unique_count in natural_keys:
    print(f"{' + '.join(columns)}: {unique_count} unique")
# Output might show: order_id: 5000 unique
#                    customer_id + order_date: 5000 unique

Preview SQL Without Running It (Trust Issues?)

# Method 1: generate_sql() for the paranoid
statements = fs.generate_sql(df, 'users', include_create=True)
for stmt in statements:
    print(stmt)  # Review before committing

# Method 2: dry_run config for the extra paranoid
config = FrameShiftConfig(dry_run=True)
fs = FrameShift(host=..., config=config)
result = fs.load(df, 'users')
print(result.sql_statements)  # See what WOULD have happened

Get Schema Recommendations (Lazy Mode)

Let Frameshift analyze your data and tell you what to do:

recs = fs.get_recommendations(df, 'transactions')
print(f"Best DISTKEY: {recs['distkey']['column']}")
print(f"Best SORTKEY: {recs['sortkey']['columns']}")
print(recs['sql'])  # Full CREATE TABLE statement

Estimate Before Loading ("Measure Once, Load Twice"....wait, is that right?)

estimates = fs.estimate_load(df)
print(f"Estimated chunks: {estimates['estimated_chunks']}")
print(f"Estimated size: {estimates['estimated_total_size_bytes'] / 1024 / 1024:.1f} MB")

if estimates['estimated_chunks'] > 100:
    print("This is gonna take a while. Coffee time.")

Custom Column Types (Control Freak Edition!)

Override the inferred types when Frameshift guesses wrong:

from frameshift.types import ColumnSpec, RedshiftType

custom_columns = [
    ColumnSpec('id', RedshiftType.INTEGER, nullable=False),
    ColumnSpec('description', RedshiftType.VARCHAR, length=4096),
    ColumnSpec('price', RedshiftType.DECIMAL, precision=10, scale=2),
]

fs.load(df, 'products', column_specs=custom_columns)

Error Handling (Things Will Go Wrong)

from frameshift import FrameShiftConfig

# Skip bad chunks and keep going (YOLO mode)
config = FrameShiftConfig(on_error='skip')

# Or log errors but continue (slightly responsible YOLO)
config = FrameShiftConfig(on_error='log')

# Or abort immediately (default, for the risk-averse)
config = FrameShiftConfig(on_error='abort')

Multiple Connection Methods

# Direct params (most common)
fs = FrameShift(host='...', database='...', user='...', password='...')

# Reuse existing connection (for the efficiency-minded)
import psycopg2
conn = psycopg2.connect(...)
fs = FrameShift(connection=conn)

# SQLAlchemy connection string (for the SQLAlchemy fans)
fs = FrameShift(connection_string='redshift+psycopg2://user:pass@host:5439/db')

# Use Amazon's official driver (for compliance checkboxes)
fs = FrameShift(host='...', driver='redshift-connector')

When NOT to Use Frameshift

Seriously, if you have S3 access:

Production ETL — Use COPY
More than 100K rows — Use COPY
Scheduled jobs — Use COPY
Anything important — Use COPY

Motto: Frameshift is for when you can't use COPY. Not when you don't want to use COPY.

License

MIT (I may not have gotten accepted by these guys, but hey, happy to use their license!)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Dec 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

frameshift-0.1.0.tar.gz (45.2 kB view details)

Uploaded Dec 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

frameshift-0.1.0-py3-none-any.whl (38.5 kB view details)

Uploaded Dec 16, 2025 Python 3

File details

Details for the file frameshift-0.1.0.tar.gz.

File metadata

Download URL: frameshift-0.1.0.tar.gz
Upload date: Dec 16, 2025
Size: 45.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for frameshift-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2d7f44c2b9eb5f3f2c0703797282cb318baae5860d16662353c1b6a23a1da415`
MD5	`690b4c47f9c5368365e3c6649ebd6923`
BLAKE2b-256	`162494109286e9e02d419ef684ffc1236ccecf933d6581da50c89b2863ddbd5f`

See more details on using hashes here.

File details

Details for the file frameshift-0.1.0-py3-none-any.whl.

File metadata

Download URL: frameshift-0.1.0-py3-none-any.whl
Upload date: Dec 16, 2025
Size: 38.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for frameshift-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f0dd4ac6cbbd32df22af1bf7ec0a43d86484b1b79b4e88ad60b6a869292b73f0`
MD5	`967b4d2e5f677e7af93c2986c9971ab3`
BLAKE2b-256	`f74bcd5a8e29c721b0562209cd3399032c357c73d453c6232873653f6df3ae77`

See more details on using hashes here.

frameshift 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Frameshift

The Honest Truth

Installation

Quick Start

Options You Might Actually Use

Replace vs Append

Distribution Keys (if you care about performance)

Progress Tracking (for the impatient)

Advanced Configs (You Probably Don't Need This)

More Advanced Use Cases

Check for Duplicates Before Loading

Analyze Your DISTKEY Before You Commit

Find Natural Keys (Let the Computer Do the Work you Dont Feel Like Doing)

Preview SQL Without Running It (Trust Issues?)

Get Schema Recommendations (Lazy Mode)

Estimate Before Loading ("Measure Once, Load Twice"....wait, is that right?)

Custom Column Types (Control Freak Edition!)

Error Handling (Things Will Go Wrong)

Multiple Connection Methods

When NOT to Use Frameshift

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes