Skip to main content

poor man´s data lake

Project description

PyDala2

PyDala2

PyPI version License: MIT Documentation

Overview 📖

PyDala2 is a high-performance Python library for managing Parquet datasets with advanced metadata capabilities. Built on Apache Arrow, it provides efficient management of Parquet datasets with features including:

  • Smart dataset management with metadata optimization
  • Multi-format support (Parquet, CSV, JSON)
  • Multi-backend integration (Polars, PyArrow, DuckDB, Pandas)
  • Advanced querying with predicate pushdown
  • Schema management with automatic validation
  • Performance optimization with caching and partitioning
  • Catalog system for centralized dataset management

✨ Key Features

  • 🚀 High Performance: Built on Apache Arrow with optimized memory usage and processing speed
  • 📊 Smart Dataset Management: Efficient Parquet handling with metadata optimization and caching
  • 🔄 Multi-backend Support: Seamlessly switch between Polars, PyArrow, DuckDB, and Pandas
  • 🔍 Advanced Querying: SQL-like filtering with predicate pushdown for maximum efficiency
  • 📋 Schema Management: Automatic validation, evolution, and tracking of data schemas
  • ⚡ Performance Optimization: Built-in caching, compression, and intelligent partitioning
  • 🛡️ Type Safety: Comprehensive validation and error handling throughout the library
  • 🏗️ Catalog System: Centralized dataset management across namespaces

🚀 Quick Start

Installation

# Install PyDala2
pip install pydala2

# Install with all optional dependencies
pip install pydala2[all]

# Install with specific backends
pip install pydala2[polars,duckdb]

Basic Usage

from pydala import ParquetDataset
import pandas as pd

# Create a dataset
dataset = ParquetDataset("data/my_dataset")

# Write data
data = pd.DataFrame({
    'id': range(100),
    'category': ['A', 'B', 'C'] * 33 + ['A'],
    'value': [i * 2 for i in range(100)]
})
dataset.write_to_dataset(
    data=data,
    partition_cols=['category']
)

# Read with filtering - automatic backend selection
result = dataset.filter("category IN ('A', 'B') AND value > 50")

# Export to different formats
df_polars = result.table.to_polars()  # or use shortcut: result.t.pl
df_pandas = result.table.df           # or result.t.df
duckdb_rel = result.table.ddb         # or result.t.ddb

Using Different Backends

# PyDala2 provides automatic backend selection
# Just access data in your preferred format:

# Polars LazyFrame (recommended for performance)
lazy_df = dataset.table.pl  # or dataset.t.pl
result = (
    lazy_df
    .filter(pl.col("value") > 100)
    .group_by("category")
    .agg(pl.mean("value"))
    .collect()
)

# DuckDB (for SQL queries)
result = dataset.ddb_con.sql("""
    SELECT category, AVG(value) as avg_value
    FROM dataset
    GROUP BY category
""").to_arrow()

# PyArrow Table (for columnar operations)
table = dataset.table.arrow  # or dataset.t.arrow

# Pandas DataFrame (for compatibility)
df_pandas = dataset.table.df  # or dataset.t.df

# Direct export methods
df_polars = dataset.table.to_polars(lazy=False)
table = dataset.table.to_arrow()
df_pandas = dataset.table.to_pandas()

Catalog Management

from pydala import Catalog

# Create catalog from YAML configuration
catalog = Catalog("catalog.yaml")

# YAML configuration example:
# tables:
#   sales_2023:
#     path: "/data/sales/2023"
#     filesystem: "local"
#   customers:
#     path: "/data/customers"
#     filesystem: "local"

# Query across datasets using automatic table loading
result = catalog.query("""
    SELECT
        s.*,
        c.customer_name,
        c.segment
    FROM sales_2023 s
    JOIN customers c ON s.customer_id = c.id
    WHERE s.date >= '2023-01-01'
""")

# Or access datasets directly
sales_dataset = catalog.get_dataset("sales_2023")
filtered_sales = sales_dataset.filter("amount > 1000")

📚 Documentation

Comprehensive documentation is available at pydala2.readthedocs.io:

Getting Started

User Guide

API Reference

Advanced Topics

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

📝 License

MIT License

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydala2-0.22.1.tar.gz (411.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydala2-0.22.1-py3-none-any.whl (69.0 kB view details)

Uploaded Python 3

File details

Details for the file pydala2-0.22.1.tar.gz.

File metadata

  • Download URL: pydala2-0.22.1.tar.gz
  • Upload date:
  • Size: 411.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.9

File hashes

Hashes for pydala2-0.22.1.tar.gz
Algorithm Hash digest
SHA256 3ab77a5a3b62f80fbef269464b0a4b9dddd3e344b55a85752d2e57e3d98a5945
MD5 44135baf97eae3e3f2da85230751d9a3
BLAKE2b-256 d0a976d1d6dc8266edf1ad8ef3383b8cd0c79122557c03845e7acf35db44d4bf

See more details on using hashes here.

File details

Details for the file pydala2-0.22.1-py3-none-any.whl.

File metadata

  • Download URL: pydala2-0.22.1-py3-none-any.whl
  • Upload date:
  • Size: 69.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.9

File hashes

Hashes for pydala2-0.22.1-py3-none-any.whl
Algorithm Hash digest
SHA256 838e260e9d03e52706afa5832d8e75babc05e29dac1dd882fced29e16d88c79e
MD5 cb953652b8d05c3ea5fc7939da8871df
BLAKE2b-256 95ee38076fc49b44e188cc6f8a748fce1de24c33fdcfcc81952ae171701295d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page