Read CSV files and convert to other file formats easily
Project description
Welcome To Datagrunt
Datagrunt is a Python library designed to simplify the way you work with CSV files. It provides a streamlined approach to reading, processing, and transforming your data into various formats, making data manipulation efficient and intuitive.
Why Datagrunt?
Born out of real-world frustration, Datagrunt eliminates the need for repetitive coding when handling CSV files. Whether you're a data analyst, data engineer, or data scientist, Datagrunt empowers you to focus on insights, not tedious data wrangling.
What Datagrunt Is Not
Datagrunt is not an extension of or a replacement for DuckDB, Polars, or PyArrow, nor is it a comprehensive data processing solution. Instead, it's designed to simplify the way you work with CSV files and to help solve the pain point of inferring delimiters when a file structure is unknown. Datagrunt provides an easy way to convert CSV files to dataframes and export them to various formats. One of Datagrunt's value propositions is its relative simplicity and ease of use.
Key Features
- Intelligent Delimiter Inference: Datagrunt automatically detects and applies the correct delimiter for your CSV files.
- Multiple Processing Engines: Choose from three powerful engines - DuckDB, Polars, and PyArrow - to handle your data processing needs.
- Flexible Data Transformation: Easily convert your processed CSV data into various formats including CSV, Excel, JSON, JSONL, and Parquet.
- AI-Powered Schema Analysis: Use Google's Gemini models to automatically generate detailed schema reports for your CSV files, including data types, column classifications, and data quality checks.
- Pythonic API: Enjoy a clean and intuitive API that integrates seamlessly into your existing Python workflows.
Powertools Under The Hood
| Tool | Description |
|---|---|
| DuckDB | Fast in-process analytical database with excellent SQL support |
| Polars | Multi-threaded DataFrame library written in Rust, optimized for performance |
| PyArrow | Python bindings for Apache Arrow with efficient columnar data processing |
| Google Gemini | A powerful family of generative AI models for schema analysis |
Installation
We recommend using UV. However, you may get started with Datagrunt in seconds using UV or pip.
Get started with UV:
uv pip install datagrunt
Get started with pip:
pip install datagrunt
Quick Start
Reading CSV Files with Multiple Engine Options
from datagrunt import CSVReader
# Load your CSV file with different engines
csv_file = 'electric_vehicle_population_data.csv'
# Choose your engine: 'polars' (default), 'duckdb', or 'pyarrow'
reader_polars = CSVReader(csv_file, engine='polars') # Default - fast DataFrame ops
reader_duckdb = CSVReader(csv_file, engine='duckdb') # Best for SQL queries
reader_pyarrow = CSVReader(csv_file, engine='pyarrow') # Arrow ecosystem integration
# Get a sample of the data
reader_duckdb.get_sample()
DuckDB Integration for Performant SQL Queries
from datagrunt import CSVReader
# Set up DuckDB engine for SQL capabilities
dg = CSVReader('electric_vehicle_population_data.csv', engine='duckdb')
# Construct your SQL query using the auto-generated table name
query = f"""
WITH core AS (
SELECT
City AS city,
"VIN (1-10)" AS vin
FROM {dg.db_table}
)
SELECT
city,
COUNT(vin) AS vehicle_count
FROM core
GROUP BY 1
ORDER BY 2 DESC
"""
# Execute the query and get results as a Polars DataFrame
df = dg.query_data(query).pl()
print(df)
Exporting Data to Multiple Formats
from datagrunt import CSVWriter
# Create writer with your preferred engine
writer = CSVWriter('input.csv', engine='duckdb') # Default for exports
# Export to various formats
writer.write_csv('output.csv') # Clean CSV export
writer.write_excel('output.xlsx') # Excel workbook
writer.write_json('output.json') # JSON format
writer.write_parquet('output.parquet') # Parquet for analytics
# Use PyArrow engine for optimized Parquet exports
writer_arrow = CSVWriter('input.csv', engine='pyarrow')
writer_arrow.write_parquet('optimized.parquet') # Native Arrow Parquet
AI-Powered Schema Analysis
from datagrunt import CSVSchemaReportAIGenerated
import os
# Generate detailed schema reports with AI
api_key = os.environ.get("GEMINI_API_KEY")
schema_analyzer = CSVSchemaReportAIGenerated(
filepath='your_data.csv',
engine='google',
api_key=api_key
)
# Get comprehensive schema analysis
report = schema_analyzer.generate_csv_schema_report(
model='gemini-2.5-flash',
return_json=True
)
print(report) # Detailed JSON schema with data types, classifications, and more
Engine Comparison
| Feature | Polars | DuckDB | PyArrow |
|---|---|---|---|
| Best for | DataFrame operations | SQL queries & analytics | Arrow ecosystem integration |
| Performance | Fast in-memory processing | Excellent for large datasets | Optimized columnar operations |
| Default for | CSVReader | CSVWriter | - |
| Export Quality | Good | Excellent (especially JSON) | Native Parquet support |
Primary Classes
CSVReader: Read and process CSV files with intelligent delimiter detectionCSVWriter: Export CSV data to multiple formats (CSV, Excel, JSON, Parquet)CSVSchemaReportAIGenerated: Generate AI-powered schema analysis reports
Full Documentation
For complete documentation, detailed examples, and advanced usage patterns, see: 📖 Complete Documentation
License
This project is licensed under the MIT License
Acknowledgements
A HUGE thank you to the open source community and the creators of DuckDB, Polars, and PyArrow for their fantastic libraries that power Datagrunt.
Source Repository
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datagrunt-2.1.0.tar.gz.
File metadata
- Download URL: datagrunt-2.1.0.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14c4238c6c5e624609d5899d36f4e3ee93659d5710058efcb583d9f7c9ef190b
|
|
| MD5 |
a0e8b35f4e43a12a55da1ccbdd31eeda
|
|
| BLAKE2b-256 |
954622332b79d2e55100544ce0ebcec2427fe7d152b51a10a79257424702c4cb
|
Provenance
The following attestation bundles were made for datagrunt-2.1.0.tar.gz:
Publisher:
publish-to-pypi.yml on pmgraham/datagrunt
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datagrunt-2.1.0.tar.gz -
Subject digest:
14c4238c6c5e624609d5899d36f4e3ee93659d5710058efcb583d9f7c9ef190b - Sigstore transparency entry: 511623241
- Sigstore integration time:
-
Permalink:
pmgraham/datagrunt@07cb2405d61f2c27c3e475a1816c364f2a4ead8b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/pmgraham
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@07cb2405d61f2c27c3e475a1816c364f2a4ead8b -
Trigger Event:
workflow_run
-
Statement type:
File details
Details for the file datagrunt-2.1.0-py3-none-any.whl.
File metadata
- Download URL: datagrunt-2.1.0-py3-none-any.whl
- Upload date:
- Size: 28.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d192a0ceb7e68267c5a32e0cc235fd2857827c7b1db6aeb9a6fa380cff13f392
|
|
| MD5 |
caf2fbbc7cc7a13ab0ba28bd24bd2e85
|
|
| BLAKE2b-256 |
56402b189fb0508c25bcbc492461b32488f2c382ed053da727fef6f7b60daee5
|
Provenance
The following attestation bundles were made for datagrunt-2.1.0-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on pmgraham/datagrunt
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datagrunt-2.1.0-py3-none-any.whl -
Subject digest:
d192a0ceb7e68267c5a32e0cc235fd2857827c7b1db6aeb9a6fa380cff13f392 - Sigstore transparency entry: 511623244
- Sigstore integration time:
-
Permalink:
pmgraham/datagrunt@07cb2405d61f2c27c3e475a1816c364f2a4ead8b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/pmgraham
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@07cb2405d61f2c27c3e475a1816c364f2a4ead8b -
Trigger Event:
workflow_run
-
Statement type: