poor man´s data lake
Project description
PyDala2
Overview 📖
PyDala2 is a high-performance Python library for managing Parquet datasets with advanced metadata capabilities. Built on Apache Arrow, it provides efficient management of Parquet datasets with features including:
- Smart dataset management with metadata optimization
- Multi-format support (Parquet, CSV, JSON)
- Multi-backend integration (Polars, PyArrow, DuckDB, Pandas)
- Advanced querying with predicate pushdown
- Schema management with automatic validation
- Performance optimization with caching and partitioning
- Catalog system for centralized dataset management
✨ Key Features
- 🚀 High Performance: Built on Apache Arrow with optimized memory usage and processing speed
- 📊 Smart Dataset Management: Efficient Parquet handling with metadata optimization and caching
- 🔄 Multi-backend Support: Seamlessly switch between Polars, PyArrow, DuckDB, and Pandas
- 🔍 Advanced Querying: SQL-like filtering with predicate pushdown for maximum efficiency
- 📋 Schema Management: Automatic validation, evolution, and tracking of data schemas
- ⚡ Performance Optimization: Built-in caching, compression, and intelligent partitioning
- 🛡️ Type Safety: Comprehensive validation and error handling throughout the library
- 🏗️ Catalog System: Centralized dataset management across namespaces
🚀 Quick Start
Installation
# Install PyDala2
pip install pydala2
# Install with all optional dependencies
pip install pydala2[all]
# Install with specific backends
pip install pydala2[polars,duckdb]
Basic Usage
from pydala import ParquetDataset
import pandas as pd
# Create a dataset
dataset = ParquetDataset("data/my_dataset")
# Write data
data = pd.DataFrame({
'id': range(100),
'category': ['A', 'B', 'C'] * 33 + ['A'],
'value': [i * 2 for i in range(100)]
})
dataset.write_to_dataset(
data=data,
partition_cols=['category']
)
# Read with filtering - automatic backend selection
result = dataset.filter("category IN ('A', 'B') AND value > 50")
# Export to different formats
df_polars = result.table.to_polars() # or use shortcut: result.t.pl
df_pandas = result.table.df # or result.t.df
duckdb_rel = result.table.ddb # or result.t.ddb
Using Different Backends
# PyDala2 provides automatic backend selection
# Just access data in your preferred format:
# Polars LazyFrame (recommended for performance)
lazy_df = dataset.table.pl # or dataset.t.pl
result = (
lazy_df
.filter(pl.col("value") > 100)
.group_by("category")
.agg(pl.mean("value"))
.collect()
)
# DuckDB (for SQL queries)
result = dataset.ddb_con.sql("""
SELECT category, AVG(value) as avg_value
FROM dataset
GROUP BY category
""").to_arrow()
# PyArrow Table (for columnar operations)
table = dataset.table.arrow # or dataset.t.arrow
# Pandas DataFrame (for compatibility)
df_pandas = dataset.table.df # or dataset.t.df
# Direct export methods
df_polars = dataset.table.to_polars(lazy=False)
table = dataset.table.to_arrow()
df_pandas = dataset.table.to_pandas()
Catalog Management
from pydala import Catalog
# Create catalog from YAML configuration
catalog = Catalog("catalog.yaml")
# YAML configuration example:
# tables:
# sales_2023:
# path: "/data/sales/2023"
# filesystem: "local"
# customers:
# path: "/data/customers"
# filesystem: "local"
# Query across datasets using automatic table loading
result = catalog.query("""
SELECT
s.*,
c.customer_name,
c.segment
FROM sales_2023 s
JOIN customers c ON s.customer_id = c.id
WHERE s.date >= '2023-01-01'
""")
# Or access datasets directly
sales_dataset = catalog.get_dataset("sales_2023")
filtered_sales = sales_dataset.filter("amount > 1000")
📚 Documentation
Comprehensive documentation is available at pydala2.readthedocs.io:
Getting Started
User Guide
API Reference
- Core Classes
- Dataset Classes
- Table Operations
- Metadata Management
- Catalog System
- Filesystem
- Utilities
Advanced Topics
🤝 Contributing
Contributions are welcome! Please see our Contributing Guide for details.
📝 License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydala2-0.21.5.tar.gz.
File metadata
- Download URL: pydala2-0.21.5.tar.gz
- Upload date:
- Size: 411.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7da2bf912440a610a1dea2297d7e7402cbb4c4be32157f130e867540de898355
|
|
| MD5 |
2a0174842a45069bb7d6c9d4e9688176
|
|
| BLAKE2b-256 |
6645254a364efc91c7cbf0192dbfd09b23a5d0117df5e1a6f7760c1f4aec64f7
|
File details
Details for the file pydala2-0.21.5-py3-none-any.whl.
File metadata
- Download URL: pydala2-0.21.5-py3-none-any.whl
- Upload date:
- Size: 68.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
800cc51699ec5995053574198bc4f586a74bd49a11447e756033254c89bbaae1
|
|
| MD5 |
5d8bd957a3819c8fd95d02def68ebffb
|
|
| BLAKE2b-256 |
3ed79abb8c68e867e01af0c0ebb890d846f47f8d3d679ba314c5f70d6a2e4000
|