Read CSV files and convert to other file formats easily

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Welcome To Datagrunt

Datagrunt is a Python library designed to simplify the way you work with CSV and PDF files. It provides a streamlined approach to reading, processing, and transforming your data into various formats, making data manipulation efficient and intuitive.

Why Datagrunt?

Born out of real-world frustration, Datagrunt eliminates the need for repetitive coding when handling CSV and PDF files. Whether you're a data analyst, data engineer, or data scientist, Datagrunt empowers you to focus on insights, not tedious data wrangling.

What Datagrunt Is Not

Datagrunt is not an extension of or a replacement for DuckDB, Polars, or PyArrow, nor is it a comprehensive data processing solution. Instead, it's designed to simplify the way you work with CSV and PDF files — solving the pain point of inferring delimiters when a CSV structure is unknown, and turning PDFs into structured, queryable data. Datagrunt provides an easy way to convert CSV files to dataframes and export them to various formats, and to extract text, tables, and images from PDFs. One of Datagrunt's value propositions is its relative simplicity and ease of use.

Key Features

Intelligent Delimiter Inference: Datagrunt automatically detects and applies the correct delimiter for your CSV files.
Path Object Support: Full support for both string paths and pathlib.Path objects for modern, cross-platform file handling.
Multiple Processing Engines: Choose from three powerful engines - DuckDB, Polars, and PyArrow - to handle your data processing needs.
Flexible Data Transformation: Easily convert your processed CSV data into various formats including CSV, Excel, JSON, JSONL, and Parquet.
PDF Parsing & OCR: Extract text, tables, and images from PDF files as dicts, DataFrames, or JSON, with optional Tesseract OCR for scanned pages. Powered by the permissively-licensed PDFium engine by default, with PyMuPDF available as an alternative.
AI-Powered Schema Analysis: Use Google's Gemini models to automatically generate detailed schema reports for your CSV files, including data types, column classifications, and data quality checks.
Pythonic API: Enjoy a clean and intuitive API that integrates seamlessly into your existing Python workflows.

Powertools Under The Hood

Tool	Description
DuckDB	Fast in-process analytical database with excellent SQL support
Polars	Multi-threaded DataFrame library written in Rust, optimized for performance
PyArrow	Python bindings for Apache Arrow with efficient columnar data processing
Google Gemini	A powerful family of generative AI models for schema analysis
PDFium	Default PDF engine (via `pypdfium2`) — permissively licensed (BSD-3 / Apache-2.0); fast text + image extraction, with a structured mode at parity with PyMuPDF
pdfplumber	Table detection and extraction (MIT), shared by both PDF engines
PyMuPDF	Alternative PDF engine for text, tables, and images (AGPL-3.0 / commercial)
Tesseract	OCR for scanned/image-only pages (optional)

Installation

We recommend using uv as the default package manager.

To install Datagrunt using uv:

uv pip install datagrunt

PDF parsing is an optional extra — install it with uv pip install "datagrunt[pdf]". See PDF parsing below for details and OCR setup.

Quick Start

Reading CSV Files with Multiple Engine Options

from datagrunt import CSVReader
from pathlib import Path

# Load your CSV file with different engines
# Accepts both string paths and Path objects
csv_file = 'electric_vehicle_population_data.csv'
csv_path = Path('electric_vehicle_population_data.csv')

# Choose your engine: 'polars' (default), 'duckdb', or 'pyarrow'
reader_polars = CSVReader(csv_file, engine='polars')    # String path - fast DataFrame ops
reader_duckdb = CSVReader(csv_path, engine='duckdb')    # Path object - best for SQL queries
reader_pyarrow = CSVReader(csv_file, engine='pyarrow')  # Arrow ecosystem integration

# Get a sample of the data
reader_duckdb.get_sample()

DuckDB Integration for Performant SQL Queries

from datagrunt import CSVReader

# Set up DuckDB engine for SQL capabilities
dg = CSVReader('electric_vehicle_population_data.csv', engine='duckdb')

# Construct your SQL query using the auto-generated table name
query = f"""
WITH core AS (
    SELECT
        City AS city,
        "VIN (1-10)" AS vin
    FROM {dg.db_table}
)
SELECT
    city,
    COUNT(vin) AS vehicle_count
FROM core
GROUP BY 1
ORDER BY 2 DESC
"""

# Execute the query and get results as a Polars DataFrame
df = dg.query_data(query).pl()
print(df)

Exporting Data to Multiple Formats

from datagrunt import CSVWriter
from pathlib import Path

# Create writer with your preferred engine (accepts both strings and Path objects)
input_file = Path('input.csv')
writer = CSVWriter(input_file, engine='duckdb')  # Default for exports

# Export to various formats
writer.write_csv('output.csv')          # Clean CSV export
writer.write_excel('output.xlsx')       # Excel workbook
writer.write_json('output.json')        # JSON format
writer.write_parquet('output.parquet')  # Parquet for analytics

# Use PyArrow engine for optimized Parquet exports
writer_arrow = CSVWriter('input.csv', engine='pyarrow')  # String path also works
writer_arrow.write_parquet('optimized.parquet')  # Native Arrow Parquet

AI-Powered Schema Analysis

from datagrunt import CSVSchemaReportAIGenerated
from pathlib import Path
import os

# Generate detailed schema reports with AI (accepts both strings and Path objects)
api_key = os.environ.get("GEMINI_API_KEY")
data_file = Path('your_data.csv')

schema_analyzer = CSVSchemaReportAIGenerated(
    filepath=data_file,  # Path object works seamlessly
    engine='google',
    api_key=api_key
)

# Get comprehensive schema analysis
report = schema_analyzer.generate_csv_schema_report(
    model='gemini-2.5-flash',
    return_json=True
)

print(report)  # Detailed JSON schema with data types, classifications, and more

PDF parsing

PDF support is an optional extra:

uv pip install "datagrunt[pdf]"

OCR of scanned pages additionally requires the Tesseract system binary (e.g. brew install tesseract on macOS, apt-get install tesseract-ocr on Debian/Ubuntu). On Windows, Tesseract runs natively (no WSL needed) via the UB-Mannheim installer or a package manager (winget install UB-Mannheim.TesseractOCR, choco install tesseract, or scoop install tesseract); after installing, either add the Tesseract directory to your PATH or point pytesseract at it with pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe". Native-text PDFs, tables, and embedded images work without it.

from datagrunt import PDFReader, PDFWriter

# Parse a PDF into the unified document structure (PDFium engine by default).
reader = PDFReader("report.pdf")
document = reader.to_dicts()           # {"document": {"pages": [...]}}
df = reader.to_dataframe()             # one row per extracted element

# Write JSON and extract embedded images to disk.
writer = PDFWriter("report.pdf")
writer.write_json("report.json", image_output_dir="report_images")
writer.extract_images(output_dir="report_images")

Choosing a PDF engine

PDFReader and PDFWriter accept an engine argument:

# Default: PDFium — permissively licensed (BSD-3 / Apache-2.0).
reader = PDFReader("report.pdf")                      # engine="pdfium"

# Lean, fast mode: text + positioned text objects + images, no table detection.
reader = PDFReader("report.pdf", native=True)

# Alternative engine: PyMuPDF (AGPL-3.0 / commercial).
reader = PDFReader("report.pdf", engine="pymupdf")

Both engines emit the same unified element schema by default, so output is interchangeable. PDFium is the default because it is permissively licensed (unlike PyMuPDF, which is AGPL-3.0 / commercial) and is faster on text-heavy documents. Table detection (via pdfplumber) and OCR work identically on either engine.

native=True (PDFium only) switches to a lean schema — full page text, positioned text objects, and images, with no table detection — which is dramatically faster (~20–80×) when you don't need structured tables.
Image-only / scanned pages fall back to Tesseract OCR automatically on both engines (requires the Tesseract binary; see above).

Parallel Processing & Concurrency

By default, PDFReader and PDFWriter run sequentially (workers=1). You can enable parallel processing on multi-core systems by passing a workers count greater than 1:

if __name__ == '__main__':
    # Run with 8 processes to parse pages concurrently
    reader = PDFReader("report.pdf", workers=8)
    document = reader.to_dicts()

Why `if name == 'main':` is required

Because PDFium is not thread-safe within a single process, datagrunt uses a process pool (ProcessPoolExecutor with the spawn start context on macOS and Windows) to parse pages concurrently.

Under Python's spawn start context, child processes import the main module to initialize. If you call PDFReader or PDFWriter with workers > 1 outside of a if __name__ == '__main__': block, the child processes will recursively spawn their own process pools, leading to a crash or infinite recursion loop.

For single-page documents, datagrunt automatically bypasses the process pool and executes sequentially to avoid process spawning overhead.

Distributed Runtimes Fallback

When running inside managed distributed environments (e.g. Apache Spark, Apache Beam, Apache Flink, or Celery), nested process spawning is restricted or causes container sandbox permission errors. datagrunt automatically detects these environments (by checking variables like SPARK_ENV_LOADED, BEAM_WORKER_ID, etc.) and falls back to sequential, parent-process execution to ensure robust, conflict-free operation.

When images are written to disk, byte-identical duplicates (common with repeated icons or backgrounds) are collapsed to a single file and all references are repointed to it. Pass dedupe=False / dedupe_images=False to keep every copy.

On graphically dense PDFs, line-based table detection can pick up decorative boxes and rule lines as 1×N or N×1 "tables". Pass drop_layout_tables=True to the reader (to_dicts, to_dataframe, to_arrow_table) or writer (write_json, write_json_newline_delimited) to discard those and keep only tables with at least two rows and two columns. It is off by default.

Engine Comparison

Feature	Polars	DuckDB	PyArrow
Best for	DataFrame operations	SQL queries & analytics	Arrow ecosystem integration
Performance	Fast in-memory processing	Excellent for large datasets	Optimized columnar operations
Default for	CSVReader	CSVWriter	-
Export Quality	Good	Excellent (especially JSON)	Native Parquet support

The engines above apply to CSV processing. PDF parsing uses the PDFium engine by default (permissively licensed), with PyMuPDF available via engine="pymupdf" — see PDF parsing.

Primary Classes

CSVReader: Read and process CSV files with intelligent delimiter detection
CSVWriter: Export CSV data to multiple formats (CSV, Excel, JSON, Parquet)
CSVSchemaReportAIGenerated: Generate AI-powered schema analysis reports
PDFReader: Parse PDF files into text, tables, and images as dicts, Polars DataFrames, or PyArrow tables
PDFWriter: Write parsed PDF output to JSON or JSONL and extract embedded images to disk

Full Documentation

For complete documentation, detailed examples, and advanced usage patterns, see: 📖 Complete Documentation

License

This project is licensed under the MIT License

Acknowledgements

A HUGE thank you to the open source community and the creators of DuckDB, Polars, and PyArrow for their fantastic libraries that power Datagrunt.

Source Repository

https://github.com/pmgraham/datagrunt

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pmgraham

These details have not been verified by PyPI

Release history Release notifications | RSS feed

4.0.1

Jun 13, 2026

4.0.0

Jun 13, 2026

3.4.1

Jun 12, 2026

3.4.0

Jun 12, 2026

3.3.1

Jun 12, 2026

3.3.0

Jun 12, 2026

3.2.7

Jun 12, 2026

3.2.6

Jun 12, 2026

3.2.5

Jun 12, 2026

3.2.4

Jun 12, 2026

3.2.3

Jun 12, 2026

3.2.2

Jun 12, 2026

3.2.1

Jun 12, 2026

3.2.0

Jun 12, 2026

3.1.40

Jun 11, 2026

3.1.39

Jun 11, 2026

3.1.38

Jun 11, 2026

3.1.37

Jun 11, 2026

3.1.36

Jun 11, 2026

3.1.35

Jun 11, 2026

3.1.34

Jun 11, 2026

3.1.33

Jun 11, 2026

3.1.32

Jun 11, 2026

3.1.31

Jun 11, 2026

3.1.30

Jun 11, 2026

3.1.29

Jun 11, 2026

3.1.28

Jun 11, 2026

3.1.27

Jun 11, 2026

3.1.26

Jun 11, 2026

3.1.25

Jun 11, 2026

3.1.24

Jun 11, 2026

3.1.23

Jun 11, 2026

3.1.22

Jun 11, 2026

3.1.21

Jun 11, 2026

3.1.19

Jun 11, 2026

3.1.18

Jun 11, 2026

3.1.17

Jun 11, 2026

3.1.16

Jun 11, 2026

3.1.15

Jun 11, 2026

3.1.14

Jun 11, 2026

3.1.13

Jun 11, 2026

3.1.12

Jun 11, 2026

This version

3.1.10

Jun 10, 2026

3.1.9

Jun 10, 2026

3.1.8

Jun 9, 2026

3.1.7

Jun 8, 2026

3.1.6

Jun 8, 2026

3.1.5

Jun 8, 2026

3.1.4

Jun 7, 2026

3.1.3

Jun 7, 2026

3.1.2

Jun 7, 2026

3.1.1

Jun 7, 2026

3.1.0

Jun 6, 2026

3.0.0

May 31, 2026

2.2.1

May 30, 2026

2.2.0

May 30, 2026

2.1.5

May 30, 2026

2.1.4

May 30, 2026

2.1.3

Nov 1, 2025

2.1.2

Sep 28, 2025

2.1.1

Sep 13, 2025

2.1.0

Sep 13, 2025

2.0.3

Sep 6, 2025

2.0.2

Jul 13, 2025

2.0.1

Jul 13, 2025

2.0.0

Jul 13, 2025

1.0.18

Jul 6, 2025

1.0.17

Jul 4, 2025

1.0.16

Jul 4, 2025

1.0.15

Jul 4, 2025

1.0.14

Jul 4, 2025

1.0.13

Jul 4, 2025

1.0.12

Jun 28, 2025

1.0.11

Jun 28, 2025

1.0.10

Mar 15, 2025

1.0.9

Mar 11, 2025

1.0.8

Mar 10, 2025

1.0.7

Mar 8, 2025

1.0.6 yanked

Mar 8, 2025

Reason this release was yanked: