Csv/cdf Read and dUMP - Sync CSV, Parquet, and CDF science files into PostgreSQL database
Project description
Welcome to Crump
Examines and syncs CSV, Parquet, and CDF files into PostgreSQL or SQLite databases in batched files using easy to edit configuration files.
Overview
crump is a command-line tool and Python library for easy syncing CSV, Parquet, and CDF files to PostgreSQL or SQLite databases, and extracxting data from CDF files. It provides a declarative, configuration-based approach to data synchronization with automatic schema management..
Key Features
Data File Support
- CSV Support: Read and sync standard CSV files
- Native CDF Processing: Built-in support for Common Data Format (CDF) science files
- Automatic Extraction: Extracts CDF variables to CSV, Parquet, or directly to database
- Array Variable Handling: Automatically expands multi-dimensional array variables
- Apache Parquet Support: Built-in support for Apache Parquet files and sync Parquet files directly to database
- Extract to Parquet: Convert CDF files to Parquet format with
--parquetflag
Data Synchronization
- Configuration-Based: Examines your CSV files with the prepare command, and defines sync jobs in YAML with sensible column mappings
- Column Mapping: Sync all columns, rename them, or only sync a subset
- Automatic Table Creation: Creates target tables if they don't exist
- Schema Evolution: Automatically adds new columns as needed, never deletes existing columns. Optionally keeps a history of data changes in a history table.
- Index Management: Suggests and creates database indexes based on column types
- Dual Interface: Use as a CLI tool or import as a Python library
- Filename-Based Extraction: Extract values from filenames (dates, versions, etc.) and store in database columns
- Automatic Cleanup: Delete stale records based on extracted filename values
- Compound Primary Keys: Support for multi-column primary keys
- Dry-Run Mode: Preview all changes without modifying the database
- Idempotent Operations: Safe to run multiple times, uses upsert
- Rich Output: Beautiful terminal output with Rich library
Quick Example
uv install crump # or pip install crump if you prefer
# Create a configuration file
crump prepare users.csv --config crump_config.yml --job users_sync
# Look at the mapping it generated for you in crump_config.yml and edit as needed.
# Crump has mapped your columns and suggested keys and indexes
# get ready to sync - you db must be available
export DATABASE_URL="sqlite:///test.db"
# Or for Postgres
# export DATABASE_URL="postgresql://user:pass@localhost:5432/mydb"
# preview changes first (requires --db-url or DATABASE_URL)
crump sync users.csv --config crump_config.yml --job users_sync --dry-run
# Sync the file to database
crump sync users.csv --config crump_config.yml --job users_sync
# Later that day the v2 of the file arrives
# Sync the new file, old records from v1 are removed automatically, updates are applied to rows that match based on primary key
crump sync users_v2.csv --config crump_config.yml --job users_sync
Example Configuration
jobs:
daily_sales:
target_table: sales
id_mapping:
sale_id: id
filename_to_column:
template: "sales_[date].csv"
columns:
date:
db_column: sync_date
type: date
use_to_delete_old_rows: true
columns:
product_id: product_id
amount: amount
This configuration:
- Syncs
sales_YYYY-MM-DD.csvfiles to thesalestable - Extracts the date from filename and stores it in
sync_datecolumn - Automatically deletes stale records for the same date after sync
- Maps CSV columns to database columns
Documentation
- Installation Guide - Install crump
- Quick Start - Get started in 5 minutes
- Configuration - YAML configuration reference
- CLI Reference - Command-line documentation
- Features - Detailed feature documentation
- API Reference - Python API documentation
- Development - Contributing guide
Programmatic Usage
from pathlib import Path
from crump import sync_csv_to_db, CrumpConfig
# Load configuration
config = CrumpConfig.from_yaml(Path("crump_config.yml"))
job = config.get_job("my_job")
# Sync CSV to database (PostgreSQL or SQLite)
rows_synced = sync_csv_to_db(
csv_path=Path("data.csv"),
job=job,
db_connection_string="postgresql://localhost/mydb"
)
print(f"Synced {rows_synced} rows")
Development
# Clone repository
git clone https://github.com/alastairtree/crump.git
cd crump
# Install with development dependencies
uv sync --all-extras
# Run tests
uv run pytest -v
# Generate documentation locally
./generate-docs.sh
See the Development Guide for detailed instructions.
Contributing
Contributions are welcome! Please see the Contributing Guide for details.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crump-0.6.2.tar.gz.
File metadata
- Download URL: crump-0.6.2.tar.gz
- Upload date:
- Size: 4.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c83d50a542ca7801b580a913cac74a5a22f3091dc9d00709589d09892c6163e1
|
|
| MD5 |
806eb4eb59ad270b163bf76d629f0f96
|
|
| BLAKE2b-256 |
eac9a41b99c298520bf433c7b6a9b368c13a52240f88c462453eb57ec538b19f
|
Provenance
The following attestation bundles were made for crump-0.6.2.tar.gz:
Publisher:
ci.yml on alastairtree/crump
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crump-0.6.2.tar.gz -
Subject digest:
c83d50a542ca7801b580a913cac74a5a22f3091dc9d00709589d09892c6163e1 - Sigstore transparency entry: 1350936615
- Sigstore integration time:
-
Permalink:
alastairtree/crump@c8d4b52f95748abc64a95a82581d8b180671176c -
Branch / Tag:
refs/tags/v0.6.2 - Owner: https://github.com/alastairtree
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@c8d4b52f95748abc64a95a82581d8b180671176c -
Trigger Event:
push
-
Statement type:
File details
Details for the file crump-0.6.2-py3-none-any.whl.
File metadata
- Download URL: crump-0.6.2-py3-none-any.whl
- Upload date:
- Size: 66.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb8b632f4ea675f02a73bafdb8207be829b2aedf2140d22fb1041b3c1b769e1d
|
|
| MD5 |
cfc3bb809c48feb96ddaef63ccfdc670
|
|
| BLAKE2b-256 |
170f4a37d511f0c7f1699e58e45d59183a61e7b4d9541c626b3dc8bb6ac004c3
|
Provenance
The following attestation bundles were made for crump-0.6.2-py3-none-any.whl:
Publisher:
ci.yml on alastairtree/crump
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crump-0.6.2-py3-none-any.whl -
Subject digest:
eb8b632f4ea675f02a73bafdb8207be829b2aedf2140d22fb1041b3c1b769e1d - Sigstore transparency entry: 1350936685
- Sigstore integration time:
-
Permalink:
alastairtree/crump@c8d4b52f95748abc64a95a82581d8b180671176c -
Branch / Tag:
refs/tags/v0.6.2 - Owner: https://github.com/alastairtree
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@c8d4b52f95748abc64a95a82581d8b180671176c -
Trigger Event:
push
-
Statement type: