poor man´s data lake

Project description

PyDala2

Overview 📖

PyDala2 is a high-performance Python library for managing Parquet datasets with advanced metadata capabilities. Built on Apache Arrow, it provides efficient management of Parquet datasets with features including:

Smart dataset management with metadata optimization
Multi-format support (Parquet, CSV, JSON)
Multi-backend integration (Polars, PyArrow, DuckDB, Pandas)
Advanced querying with predicate pushdown
Schema management with automatic validation
Performance optimization with caching and partitioning
Catalog system for centralized dataset management

✨ Key Features

🚀 High Performance: Built on Apache Arrow with optimized memory usage and processing speed
📊 Smart Dataset Management: Efficient Parquet handling with metadata optimization and caching
🔄 Multi-backend Support: Seamlessly switch between Polars, PyArrow, DuckDB, and Pandas
🔍 Advanced Querying: SQL-like filtering with predicate pushdown for maximum efficiency
📋 Schema Management: Automatic validation, evolution, and tracking of data schemas
⚡ Performance Optimization: Built-in caching, compression, and intelligent partitioning
🛡️ Type Safety: Comprehensive validation and error handling throughout the library
🏗️ Catalog System: Centralized dataset management across namespaces

🚀 Quick Start

Installation

# Install PyDala2
pip install pydala2

# Install with all optional dependencies
pip install pydala2[all]

# Install with specific backends
pip install pydala2[polars,duckdb]

Basic Usage

from pydala import ParquetDataset
import pandas as pd

# Create a dataset
dataset = ParquetDataset("data/my_dataset")

# Write data
data = pd.DataFrame({
    'id': range(100),
    'category': ['A', 'B', 'C'] * 33 + ['A'],
    'value': [i * 2 for i in range(100)]
})
dataset.write_to_dataset(
    data=data,
    partition_cols=['category']
)

# Read with filtering - automatic backend selection
result = dataset.filter("category IN ('A', 'B') AND value > 50")

# Export to different formats
df_polars = result.table.to_polars()  # or use shortcut: result.t.pl
df_pandas = result.table.df           # or result.t.df
duckdb_rel = result.table.ddb         # or result.t.ddb

Using Different Backends

# PyDala2 provides automatic backend selection
# Just access data in your preferred format:

# Polars LazyFrame (recommended for performance)
lazy_df = dataset.table.pl  # or dataset.t.pl
result = (
    lazy_df
    .filter(pl.col("value") > 100)
    .group_by("category")
    .agg(pl.mean("value"))
    .collect()
)

# DuckDB (for SQL queries)
result = dataset.ddb_con.sql("""
    SELECT category, AVG(value) as avg_value
    FROM dataset
    GROUP BY category
""").to_arrow()

# PyArrow Table (for columnar operations)
table = dataset.table.arrow  # or dataset.t.arrow

# Pandas DataFrame (for compatibility)
df_pandas = dataset.table.df  # or dataset.t.df

# Direct export methods
df_polars = dataset.table.to_polars(lazy=False)
table = dataset.table.to_arrow()
df_pandas = dataset.table.to_pandas()

Catalog Management

from pydala import Catalog

# Create catalog from YAML configuration
catalog = Catalog("catalog.yaml")

# YAML configuration example:
# tables:
#   sales_2023:
#     path: "/data/sales/2023"
#     filesystem: "local"
#   customers:
#     path: "/data/customers"
#     filesystem: "local"

# Query across datasets using automatic table loading
result = catalog.query("""
    SELECT
        s.*,
        c.customer_name,
        c.segment
    FROM sales_2023 s
    JOIN customers c ON s.customer_id = c.id
    WHERE s.date >= '2023-01-01'
""")

# Or access datasets directly
sales_dataset = catalog.get_dataset("sales_2023")
filtered_sales = sales_dataset.filter("amount > 1000")

📚 Documentation

Comprehensive documentation is available at pydala2.readthedocs.io:

Getting Started

User Guide

API Reference

Advanced Topics

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

📝 License

MIT License

Project details

Release history Release notifications | RSS feed

0.22.5

Oct 21, 2025

0.22.4

Oct 20, 2025

This version

0.22.3

Oct 15, 2025

0.22.2

Oct 14, 2025

0.22.1

Oct 14, 2025

0.22.0

Oct 14, 2025

0.21.5

Sep 24, 2025

0.21.4

Sep 24, 2025

0.21.3

Sep 15, 2025

0.21.2

Sep 15, 2025

0.21.1

Sep 15, 2025

0.21.0

Sep 15, 2025

0.20.0

Sep 15, 2025

0.9.9

Jul 29, 2025

0.9.8

Jul 15, 2025

0.9.7.7

May 16, 2025

0.9.7.6

May 15, 2025

0.9.7.4

Mar 26, 2025

0.9.7.3

Mar 18, 2025

0.9.7.2

Mar 18, 2025

0.9.7.1

Mar 18, 2025

0.9.7

Feb 20, 2025

0.9.6

Jan 29, 2025

0.9.5.1

Jan 20, 2025

0.9.5

Jan 20, 2025

0.9.4.5

Dec 18, 2024

0.9.4.4

Dec 18, 2024

0.9.4.3

Dec 18, 2024

0.9.4.2

Dec 18, 2024

0.9.4.1

Dec 18, 2024

0.9.3.18

Dec 12, 2024

0.9.3.17

Dec 11, 2024

0.9.3.16

Dec 11, 2024

0.9.3.15

Dec 11, 2024

0.9.3.14

Dec 10, 2024

0.9.3.13

Dec 10, 2024

0.9.3.12

Dec 9, 2024

0.9.3.11

Dec 9, 2024

0.9.3.10

Dec 9, 2024

0.9.3.9

Dec 9, 2024

0.9.3.8

Dec 9, 2024

0.9.3.7

Dec 9, 2024

0.9.3.6

Dec 9, 2024

0.9.3.5

Dec 9, 2024

0.9.3.4

Dec 9, 2024

0.9.3.3

Dec 9, 2024

0.9.3.2

Dec 6, 2024

0.9.3.1

Dec 5, 2024

0.9.3

Dec 5, 2024

0.9.2.3

Dec 5, 2024

0.9.2.2

Dec 5, 2024

0.9.2.1

Dec 5, 2024

0.9.2

Dec 5, 2024

0.9.1.11

Dec 5, 2024

0.9.1.10

Dec 5, 2024

0.9.1.9

Dec 5, 2024

0.9.1.8

Dec 5, 2024

0.9.1.7

Dec 5, 2024

0.9.1.6

Dec 5, 2024

0.9.1.5

Dec 4, 2024

0.9.1.4

Dec 4, 2024

0.9.1.3

Dec 4, 2024

0.9.1.2

Dec 4, 2024

0.9.1.1

Dec 4, 2024

0.9.1

Dec 4, 2024

0.9.0

Nov 22, 2024

0.8.8.3

Oct 23, 2024

0.8.8.2

Oct 23, 2024

0.8.8.1

Oct 23, 2024

0.8.8

Oct 23, 2024

0.8.7.10

Oct 21, 2024

0.8.7.9

Oct 21, 2024

0.8.7.8

Oct 21, 2024

0.8.7.7

Oct 18, 2024

0.8.7.6

Oct 18, 2024

0.8.7.5

Oct 18, 2024

0.8.7.4

Oct 18, 2024

0.8.7.3

Oct 18, 2024

0.8.7.2

Oct 18, 2024

0.8.7.1

Oct 10, 2024

0.8.7

Aug 21, 2024

0.8.6

Aug 20, 2024

0.8.3.7

Aug 14, 2024

0.8.3.6

Aug 14, 2024

0.8.3.5

Aug 14, 2024

0.8.3.4

Aug 14, 2024

0.8.3.3

Aug 13, 2024

0.8.3.2

Aug 13, 2024

0.8.3.1

Aug 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydala2-0.22.3.tar.gz (412.3 kB view details)

Uploaded Oct 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydala2-0.22.3-py3-none-any.whl (69.4 kB view details)

Uploaded Oct 15, 2025 Python 3

File details

Details for the file pydala2-0.22.3.tar.gz.

File metadata

Download URL: pydala2-0.22.3.tar.gz
Upload date: Oct 15, 2025
Size: 412.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.9

File hashes

Hashes for pydala2-0.22.3.tar.gz
Algorithm	Hash digest
SHA256	`3fd4996ce90729c6928985905cd40d1c2c81e336f5ddb484e0cedc2c59caf506`
MD5	`8552000b075f4054c73e2fee2c0da314`
BLAKE2b-256	`23c9acfc99c9441f42367a6e7e492de376cdf1a231b1aff385aaec147e6835c4`

See more details on using hashes here.

File details

Details for the file pydala2-0.22.3-py3-none-any.whl.

File metadata

Download URL: pydala2-0.22.3-py3-none-any.whl
Upload date: Oct 15, 2025
Size: 69.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.9

File hashes

Hashes for pydala2-0.22.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b3c97695f2495bbc1791e29b809b7950000d8e5ee6197b09f9759dcd461053c3`
MD5	`4448fda0495a0260a7137f143f59a935`
BLAKE2b-256	`27064f04f35e527539cd3eb7cefd004c9095ff481a9ea683da0f27034a4166c7`

See more details on using hashes here.

pydala2 0.22.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

PyDala2

Overview 📖

✨ Key Features

🚀 Quick Start

Installation

Basic Usage

Using Different Backends

Catalog Management

📚 Documentation

Getting Started

User Guide

API Reference

Advanced Topics

🤝 Contributing

📝 License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes