datablade is a suite of functions to provide standard syntax across data engineering projects.

These details have not been verified by PyPI

Project description

datablade

datablade is a small, single-machine Python toolkit for data engineers who need reliable “file → DataFrame/Parquet → SQL DDL” workflows.

It focuses on:

Reading common file formats with memory-aware heuristics
Streaming large files in chunks (without concatenating)
Normalizing DataFrame columns for downstream systems
Generating CREATE TABLE DDL across a small set of SQL dialects
Producing bulk-load commands (and executing BCP for SQL Server)

What datablade Does

datablade helps data engineers:

Load data efficiently from common file formats with automatic memory heuristics
Standardize data cleaning with consistent column naming and type inference
Generate database schemas for multiple SQL dialects from DataFrames or Parquet schemas
Handle datasets that don't fit in memory using chunked iteration and optional Polars acceleration
Work across databases with cross-dialect DDL and bulk-load command generation
Maintain data quality with built-in validation and logging

When to Use datablade

datablade is ideal for:

✅ ETL/ELT Pipelines - Building reproducible data ingestion workflows across multiple source formats

✅ Multi-Database Projects - Deploying the same schema to SQL Server, PostgreSQL, MySQL, or DuckDB

✅ Large File Processing - Streaming CSV/TSV/TXT/Parquet without concatenating

✅ Data Lake to Warehouse - Converting raw files to Parquet with optimized schemas

✅ Ad-hoc Data Analysis - Quickly exploring and preparing datasets with consistent patterns

✅ Legacy System Integration - Standardizing messy column names and data types from external sources

When datablade is not the right tool

Real-time streaming ingestion (Kafka, Spark Structured Streaming)
Distributed compute / cluster execution (Spark, Dask)
Warehouse-native transformations and modeling (dbt)
A full-featured schema migration tool (Alembic, Flyway)
Direct database connectivity/transactions (datablade generates SQL; it does not manage connections)

Installation

pip install git+https://github.com/brentwc/data-prep.git

Optional dependencies:

# For high-performance file reading with Polars
pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[performance]

# For development and testing
pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[dev]

# All optional dependencies
pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[all]

Features

datablade provides four main modules:

📊 `datablade.dataframes`

DataFrame operations and transformations:

Clean and normalize DataFrame columns
Auto-detect and convert data types
Generate optimized Parquet schemas
Convert pandas DataFrames to PyArrow tables
Generate multi-dialect SQL DDL statements
Memory-aware file reading with automatic chunking
Polars integration for high-performance large file processing
Partitioned Parquet writing for datasets that don't fit in memory

🌐 `datablade.io`

Input/output operations for external data:

Fetch JSON data from URLs
Download and extract ZIP files

🛠️ `datablade.utils`

General utility functions:

SQL name quoting
Path standardization
List flattening
Configurable logging with Python logging module

🗄️ `datablade.sql`

Multi-dialect SQL utilities:

Multi-dialect support: SQL Server, PostgreSQL, MySQL, DuckDB
Dialect-aware identifier quoting
CREATE TABLE generation for all dialects (from pandas DataFrames)
CREATE TABLE generation from Parquet schemas (schema-only, via PyArrow)
Bulk loading helpers:
- SQL Server: executes bcp via subprocess
- PostgreSQL/MySQL/DuckDB: returns command strings you can run in your environment

Quick Start

import pandas as pd
from datablade import configure_logging, read_file_smart
from datablade.dataframes import clean_dataframe_columns, pandas_to_parquet_table
from datablade.io import get_json
from datablade.utils import sql_quotename
from datablade.sql import Dialect, generate_create_table, generate_create_table_from_parquet

# Configure logging
import logging
configure_logging(level=logging.INFO, log_file="datablade.log")

# Read a file into a single DataFrame (materializes)
df = read_file_smart('large_dataset.csv', verbose=True)

# Clean DataFrame
df = clean_dataframe_columns(df, verbose=True)

# Convert to Parquet
table = pandas_to_parquet_table(df, convert=True)

# Generate SQL DDL for multiple dialects
sql_sqlserver = generate_create_table(df, table='my_table', dialect=Dialect.SQLSERVER)
sql_postgres = generate_create_table(df, table='my_table', dialect=Dialect.POSTGRES)

# Generate SQL DDL directly from an existing Parquet schema (no data materialization)
# Note: nested Parquet types (struct/list/map/union) are dropped with a warning.
ddl_from_parquet = generate_create_table_from_parquet(
    "events.parquet",
    table="events",
    dialect=Dialect.POSTGRES,
)

# Fetch JSON data
data = get_json('https://api.example.com/data.json')

Memory-Aware File Reading

from datablade.dataframes import read_file_chunked, read_file_iter, read_file_to_parquets, stream_to_parquets

# Read large files in chunks
for chunk in read_file_chunked('huge_file.csv', memory_fraction=0.5):
    process(chunk)

# Stream without ever concatenating/materializing
for chunk in read_file_iter('huge_file.csv', memory_fraction=0.3, verbose=True):
    process(chunk)

# Parquet is also supported for streaming (single .parquet files)
for chunk in read_file_iter('huge_file.parquet', memory_fraction=0.3, verbose=True):
    process(chunk)

# Partition large files to multiple Parquets
files = read_file_to_parquets(
    'large_file.csv',
    output_dir='partitioned/',
    convert_types=True,
    verbose=True
)

# Stream to Parquet partitions without materializing
files = stream_to_parquets(
    'large_file.csv',
    output_dir='partitioned_streamed/',
    rows_per_file=200_000,
    convert_types=True,
    verbose=True,
)

Blade (Optional Facade)

The canonical API is module-level functions (for example, datablade.dataframes.read_file_iter).

If you prefer an object-style entrypoint with shared defaults, you can use the optional Blade facade:

from datablade import Blade
from datablade.sql import Dialect

blade = Blade(memory_fraction=0.3, verbose=True, convert_types=True)

for chunk in blade.iter("huge.csv"):
    process(chunk)

files = blade.stream_to_parquets("huge.csv", output_dir="partitioned/")

# Generate DDL (CREATE TABLE)
ddl = blade.create_table_sql(
    df,
    table="my_table",
    dialect=Dialect.POSTGRES,
)

# Generate DDL from an existing Parquet file (schema-only)
ddl2 = blade.create_table_sql_from_parquet(
    "events.parquet",
    table="events",
    dialect=Dialect.POSTGRES,
)

Documentation

Docs Home - Documentation landing page
Usage Guide - File reading (including streaming), SQL, IO, logging
Testing Guide - How to run tests locally
Test Suite - Testing documentation and coverage

Testing

Run the test suite:

# Install with test dependencies
pip install -e ".[test]"

# Run all tests
pytest

# Run with coverage report
pytest --cov=datablade --cov-report=html

See tests/README.md for detailed testing documentation.

Backward Compatibility

All functions are available through the legacy datablade.core module for backward compatibility:

# Legacy imports (still supported)
from datablade.core.frames import clean_dataframe_columns
from datablade.core.json import get

Requirements

Core dependencies:

pandas
pyarrow
numpy
openpyxl
requests

Design choices and limitations

Single-machine focus: datablade is designed for laptop/VM/server execution, not clusters.
Streaming vs materializing:
- Use read_file_iter() to process arbitrarily large files chunk-by-chunk.
- read_file_smart() returns a single DataFrame and may still be memory-intensive.
Parquet support:
- Streaming reads support single .parquet files.
- Parquet “dataset directories” (Hive partitions / directory-of-parquets) are not a primary target API.
Parquet → SQL DDL:
- Uses the Parquet schema (PyArrow) without scanning data.
- Complex/nested columns (struct/list/map/union) are dropped and logged as warnings.
DDL scope: CREATE TABLE generation is column/type oriented (no indexes/constraints).

Optional dependencies:

polars (for high-performance file reading)
psutil (for memory-aware operations)
pytest (for testing)

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.8

Mar 16, 2026

0.0.7

Mar 16, 2026

0.0.6

Feb 5, 2026

This version

0.0.5

Dec 30, 2025

0.0.0

Dec 20, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datablade-0.0.5.tar.gz (45.5 kB view details)

Uploaded Dec 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datablade-0.0.5-py3-none-any.whl (35.0 kB view details)

Uploaded Dec 30, 2025 Python 3

File details

Details for the file datablade-0.0.5.tar.gz.

File metadata

Download URL: datablade-0.0.5.tar.gz
Upload date: Dec 30, 2025
Size: 45.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datablade-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`76fb1f46c0371f6a1ff8281f26fd808afa8bb8723cddc48736aa6969f8ae997b`
MD5	`7d62603df3e567d5c06a70601f9a4435`
BLAKE2b-256	`2daeb57af897dcd546f616b0c3c5c67258debd58c0acd43c2a225f2bc25129be`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datablade-0.0.5.tar.gz:

Publisher: publish.yml on brentwc/data-prep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datablade-0.0.5.tar.gz
- Subject digest: 76fb1f46c0371f6a1ff8281f26fd808afa8bb8723cddc48736aa6969f8ae997b
- Sigstore transparency entry: 782418575
- Sigstore integration time: Dec 30, 2025
Source repository:
- Permalink: brentwc/data-prep@29444649e1598dba852fbaa2b8fd96722bfd4ba9
- Branch / Tag: refs/heads/main
- Owner: https://github.com/brentwc
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@29444649e1598dba852fbaa2b8fd96722bfd4ba9
- Trigger Event: workflow_dispatch

File details

Details for the file datablade-0.0.5-py3-none-any.whl.

File metadata

Download URL: datablade-0.0.5-py3-none-any.whl
Upload date: Dec 30, 2025
Size: 35.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datablade-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`df4ca1db3e8769b3c7325eda8246324f0feaffb0645f9aea516b95c34e916f28`
MD5	`a0410c3c671f15aaf254a51f878d79c6`
BLAKE2b-256	`3aa6ab42e837326d1adc9a2cc33d963b62caf7c64d1accd7c84fd49ffd61b199`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datablade-0.0.5-py3-none-any.whl:

Publisher: publish.yml on brentwc/data-prep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datablade-0.0.5-py3-none-any.whl
- Subject digest: df4ca1db3e8769b3c7325eda8246324f0feaffb0645f9aea516b95c34e916f28
- Sigstore transparency entry: 782418579
- Sigstore integration time: Dec 30, 2025
Source repository:
- Permalink: brentwc/data-prep@29444649e1598dba852fbaa2b8fd96722bfd4ba9
- Branch / Tag: refs/heads/main
- Owner: https://github.com/brentwc
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@29444649e1598dba852fbaa2b8fd96722bfd4ba9
- Trigger Event: workflow_dispatch

datablade 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

datablade

What datablade Does

When to Use datablade

When datablade is not the right tool

Installation

Features

📊 datablade.dataframes

🌐 datablade.io

🛠️ datablade.utils

🗄️ datablade.sql

Quick Start

Memory-Aware File Reading

Blade (Optional Facade)

Documentation

Testing

Backward Compatibility

Requirements

Design choices and limitations

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

📊 `datablade.dataframes`

🌐 `datablade.io`

🛠️ `datablade.utils`

🗄️ `datablade.sql`