datablade is a suite of functions to provide standard syntax across data engineering projects.
Project description
datablade
datablade is a small, single-machine Python toolkit for data engineers who need reliable “file → DataFrame/Parquet → SQL DDL” workflows.
It focuses on:
- Reading common file formats with memory-aware heuristics
- Streaming large files in chunks (without concatenating)
- Normalizing DataFrame columns for downstream systems
- Generating
CREATE TABLEDDL across a small set of SQL dialects - Producing bulk-load commands (and executing BCP for SQL Server)
What datablade Does
datablade helps data engineers:
- Load data efficiently from common file formats with automatic memory heuristics
- Standardize data cleaning with consistent column naming and type inference
- Apply first-class schema rules with full or partial column overrides before Parquet output
- Generate database schemas for multiple SQL dialects from DataFrames or Parquet schemas
- Handle datasets that don't fit in memory using chunked iteration and optional Polars acceleration
- Work across databases with cross-dialect DDL and bulk-load command generation
- Maintain data quality with built-in validation and logging
When to Use datablade
datablade is ideal for:
✅ ETL/ELT Pipelines - Building reproducible data ingestion workflows across multiple source formats
✅ Multi-Database Projects - Deploying the same schema to SQL Server, PostgreSQL, MySQL, or DuckDB
✅ Large File Processing - Streaming CSV/TSV/TXT/Parquet without concatenating
✅ Data Lake to Warehouse - Converting raw files to Parquet with optimized schemas
✅ Ad-hoc Data Analysis - Quickly exploring and preparing datasets with consistent patterns
✅ Legacy System Integration - Standardizing messy column names and data types from external sources
When datablade is not the right tool
- Real-time streaming ingestion (Kafka, Spark Structured Streaming)
- Distributed compute / cluster execution (Spark, Dask)
- Warehouse-native transformations and modeling (dbt)
- A full-featured schema migration tool (Alembic, Flyway)
- Direct database connectivity/transactions (datablade generates SQL; it does not manage connections)
Installation
pip install datablade
Optional dependencies:
# For high-performance file reading with Polars
pip install "datablade[performance]"
# For testing
pip install "datablade[test]"
# For development (includes testing + lint/format tooling)
pip install "datablade[dev]"
# All optional dependencies
pip install "datablade[all]"
Features
datablade provides four main modules:
📊 datablade.dataframes
DataFrame operations and transformations:
- Clean and normalize DataFrame columns
- Auto-detect and convert data types
- Apply
SchemaConfigrules to cleaned column names before Parquet conversion - Generate optimized Parquet schemas
- Convert pandas DataFrames to PyArrow tables
- Generate multi-dialect SQL DDL statements
- Memory-aware file reading with automatic chunking
- Polars integration for high-performance large file processing
- Partitioned Parquet writing for datasets that don't fit in memory
🌐 datablade.io
Input/output operations for external data:
- Fetch JSON data from URLs
- Download and extract ZIP files
🛠️ datablade.utils
General utility functions:
- SQL name quoting
- Path standardization
- List flattening
- Configurable logging with Python logging module
🗄️ datablade.sql
Multi-dialect SQL utilities:
- Multi-dialect support: SQL Server, PostgreSQL, MySQL, DuckDB
- Dialect-aware identifier quoting
- CREATE TABLE generation for all dialects (from pandas DataFrames)
- CREATE TABLE generation from Parquet schemas (schema-only, via PyArrow)
- Optional
schema_specoverrides for column types, nullability, and string sizing - Bulk loading helpers:
- SQL Server: executes
bcpvia subprocess - PostgreSQL/MySQL/DuckDB: returns command strings you can run in your environment
- SQL Server: executes
Quick Start
from datablade import SchemaConfig, configure_logging, read_file_smart
from datablade.dataframes import clean_dataframe_columns, pandas_to_parquet_table
from datablade.io import get_json
from datablade.utils import sql_quotename
from datablade.sql import Dialect, generate_create_table, generate_create_table_from_parquet
# Configure logging
import logging
configure_logging(level=logging.INFO, log_file="datablade.log")
# Read a file into a single DataFrame (materializes)
# Use dtype="string" when you want the most lossless ingest path.
df = read_file_smart("large_dataset.csv", verbose=True, dtype="string")
# Clean DataFrame
df = clean_dataframe_columns(df, verbose=True)
# Apply full or partial schema rules before Parquet conversion
schema_config = SchemaConfig(
numeric_policy="float64",
columns={
"customer_id": "Int64",
"amount": "Float64",
"event_ts": "datetime64[ns, UTC]",
},
)
# Convert to Parquet
table = pandas_to_parquet_table(df, convert=True, schema_config=schema_config)
# Generate SQL DDL for multiple dialects
sql_sqlserver = generate_create_table(df, table='my_table', dialect=Dialect.SQLSERVER)
sql_postgres = generate_create_table(df, table='my_table', dialect=Dialect.POSTGRES)
# Generate SQL DDL directly from an existing Parquet schema (no data materialization)
# Note: nested Parquet types (struct/list/map/union) are dropped with a warning.
ddl_from_parquet = generate_create_table_from_parquet(
"events.parquet",
table="events",
dialect=Dialect.POSTGRES,
)
# Fetch JSON data
data = get_json('https://api.example.com/data.json')
SchemaConfig rules match cleaned column names. Keep numeric_policy="infer" for
the existing behavior, use "float64" to normalize unnamed integer, float, and
numeric-looking text columns to nullable Float64, or use "string" to skip
unnamed numeric inference. Use explicit columns={...} rules when a specific
column must always land on an exact dtype. For truly lossless CSV ingestion, load
text at read time with dtype="string" as shown above.
Most file path parameters accept str or pathlib.Path. To treat case mismatches
as errors on case-insensitive filesystems, use configure_paths(path_strict=True).
Memory-Aware File Reading
See the file format support matrix in the bundled USAGE doc:
python -m datablade.docs --show USAGE
from datablade.dataframes import (
excel_to_parquets,
read_file_chunked,
read_file_iter,
read_file_to_parquets,
stream_to_parquets,
)
# Read large files in chunks
for chunk in read_file_chunked('huge_file.csv', memory_fraction=0.5):
process(chunk)
# Stream without ever concatenating/materializing
for chunk in read_file_iter('huge_file.csv', memory_fraction=0.3, verbose=True):
process(chunk)
# Parquet is also supported for streaming (single .parquet files)
for chunk in read_file_iter('huge_file.parquet', memory_fraction=0.3, verbose=True):
process(chunk)
# Excel streaming is available with openpyxl installed (read-only mode)
for chunk in read_file_iter('large.xlsx', chunksize=25_000, verbose=True):
process(chunk)
# Partition large files to multiple Parquets
files = read_file_to_parquets(
'large_file.csv',
output_dir='partitioned/',
convert_types=True,
verbose=True
)
# Stream to Parquet partitions without materializing
files = stream_to_parquets(
'large_file.csv',
output_dir='partitioned_streamed/',
rows_per_file=200_000,
convert_types=True,
verbose=True,
)
# Excel streaming to Parquet partitions
files = excel_to_parquets(
'large.xlsx',
output_dir='partitioned_excel/',
rows_per_file=200_000,
convert_types=True,
verbose=True,
)
Blade (Optional Facade)
The canonical API is module-level functions (for example, datablade.dataframes.read_file_iter).
If you prefer an object-style entrypoint with shared defaults, you can use the optional Blade facade:
from datablade import Blade
from datablade.sql import Dialect
blade = Blade(memory_fraction=0.3, verbose=True, convert_types=True)
for chunk in blade.iter("huge.csv"):
process(chunk)
files = blade.stream_to_parquets("huge.csv", output_dir="partitioned/")
# Generate DDL (CREATE TABLE)
ddl = blade.create_table_sql(
df,
table="my_table",
dialect=Dialect.POSTGRES,
)
# Generate DDL from an existing Parquet file (schema-only)
ddl2 = blade.create_table_sql_from_parquet(
"events.parquet",
table="events",
dialect=Dialect.POSTGRES,
)
Documentation
Docs are bundled with the installed package:
python -m datablade.docs --list
python -m datablade.docs --show USAGE
python -m datablade.docs --write-dir .\datablade-docs
After writing docs to disk, open the markdown files locally:
- README (docs landing page)
- USAGE (file reading, streaming, SQL, IO, logging)
- TESTING (how to run tests locally)
- ARCHITECTURE (pipeline overview)
- OBJECT_REGISTRY (registry reference)
Testing
Run the test suite:
# Install with test dependencies
pip install -e ".[test]"
# Run all tests
pytest
# Run with coverage report
pytest --cov=datablade --cov-report=html
For detailed testing documentation, use the bundled TESTING doc:
python -m datablade.docs --show TESTING
Backward Compatibility
All functions are available through the legacy datablade.core module for backward compatibility:
# Legacy imports (still supported)
from datablade.core.frames import clean_dataframe_columns
from datablade.core.json import get
Requirements
Core dependencies:
- pandas
- pyarrow
- numpy
- openpyxl
- requests
Design choices and limitations
- Single-machine focus: datablade is designed for laptop/VM/server execution, not clusters.
- Streaming vs materializing:
- Use
read_file_iter()to process arbitrarily large files chunk-by-chunk. read_file_smart()returns a single DataFrame and may still be memory-intensive.
- Use
- Chunk concatenation: the large-file pandas fallback in
read_file_smart()can temporarily spike memory usage during concat. Useread_file_iter()orreturn_type="iterator"to avoid concatenation. - Polars materialization: when returning a pandas DataFrame, Polars still
collects into memory; use
return_type="polars"or"polars_lazy"to keep Polars frames. - Parquet support:
- Streaming reads support single
.parquetfiles. - Parquet “dataset directories” (Hive partitions / directory-of-parquets) are not a primary target API.
- Streaming reads support single
- Parquet → SQL DDL:
- Uses the Parquet schema (PyArrow) without scanning data.
- Complex/nested columns (struct/list/map/union) are dropped and logged as warnings.
- DDL scope:
CREATE TABLEgeneration is column/type oriented (no indexes/constraints). - SQL Server bulk load: the SQL Server helpers use the
bcpCLI and require it to be installed and available on PATH. When using-U/-P, credentials are passed via process args (logs are redacted); prefer-Tor-Gwhere possible.
Optional dependencies:
- polars (for high-performance file reading)
- psutil (for memory-aware operations)
- pytest (for testing)
License
MIT
Links
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datablade-0.0.8.tar.gz.
File metadata
- Download URL: datablade-0.0.8.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
311e1634583bdc770d124ca5cfd9e70a6e5590df3d96175a1c38b4bdb0a09ffc
|
|
| MD5 |
dfe1f1d7b5fefb2be0944cd890f57e91
|
|
| BLAKE2b-256 |
07e15abebf3aa104777e0fe458fbe983333b134ddea8d0abff772e89389ccc34
|
Provenance
The following attestation bundles were made for datablade-0.0.8.tar.gz:
Publisher:
publish.yml on brentwc/datablade
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datablade-0.0.8.tar.gz -
Subject digest:
311e1634583bdc770d124ca5cfd9e70a6e5590df3d96175a1c38b4bdb0a09ffc - Sigstore transparency entry: 1111591039
- Sigstore integration time:
-
Permalink:
brentwc/datablade@395ccc10009e4d473b731dbd6bf9389fbe1cbb9e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/brentwc
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@395ccc10009e4d473b731dbd6bf9389fbe1cbb9e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file datablade-0.0.8-py3-none-any.whl.
File metadata
- Download URL: datablade-0.0.8-py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24353fd8c11508f263ac6bf692ffa01e25f53f510f37fb0dd58f89bcc5cd6e50
|
|
| MD5 |
49a9e1708de24cc50b3cc99f724ce551
|
|
| BLAKE2b-256 |
c3d26c60461eb46d1ad33b1235210e555d199b2f500484795718069d612b036d
|
Provenance
The following attestation bundles were made for datablade-0.0.8-py3-none-any.whl:
Publisher:
publish.yml on brentwc/datablade
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datablade-0.0.8-py3-none-any.whl -
Subject digest:
24353fd8c11508f263ac6bf692ffa01e25f53f510f37fb0dd58f89bcc5cd6e50 - Sigstore transparency entry: 1111591115
- Sigstore integration time:
-
Permalink:
brentwc/datablade@395ccc10009e4d473b731dbd6bf9389fbe1cbb9e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/brentwc
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@395ccc10009e4d473b731dbd6bf9389fbe1cbb9e -
Trigger Event:
workflow_dispatch
-
Statement type: