poor man´s data lake
Project description
PyDala2
Overview 📖
Pydala is a high-performance Python library for managing Parquet datasets with powerful metadata capabilities. Built on Apache Arrow, it provides an efficient, user-friendly interface for handling large-scale data operations.
✨ Key Features
- 📦 Smart Dataset Management: Efficient Parquet handling with metadata optimization
- 🔄 Robust Caching: Built-in support for faster data access
- 🔌 Seamless Integration: Works with Polars, PyArrow, and DuckDB
- 🔍 Advanced Querying: SQL-like filtering with predicate pushdown
- 🛠️ Schema Management: Automatic validation and tracking
🚀 Quick Start
Installation
pip install pydala2
📊 Creating a Dataset
from pydala.dataset import ParquetDataset
dataset = ParquetDataset(
path="path/to/dataset",
partitioning="hive", # Hive-style partitioning
timestamp_column="timestamp", # For time-based operations
cached=True # Enable performance caching
)
💾 Writing Data
import polars as pl
# Create sample time-series data
df = pl.DataFrame({
"timestamp": pl.date_range(0, 1000, "1d"),
"value": range(1000)
})
# Write with smart partitioning and compression
dataset.write_to_dataset(
data=df, # Can be a polars or pandas DataFrame or an Arrow Table, Dataset, or RecordBatch or a duckdb result
mode="overwrite", # Options: "overwrite", "append", "delta"
row_group_size=250_000, # Optimize chunk size
compression="zstd", # High-performance compression
partition_by=["year", "month"], # Auto-partition by time
unique=True # Ensure data uniqueness
)
📥 Reading & Converting Data
dataset.load(update_metadata=True)
# Flexible data format conversion
pt = dataset.t # PyDala Table
df_polars = pt.to_polars() # Convert to Polars
df_pandas = pt.to_pandas() # Convert to Pandas
df_arrow = pt.to_arrow() # Convert to Arrow
rel_ddb = pt.to_ddb() # Convert DuckDB relation
# and many more...
🔍 Smart Querying
# Efficient filtered reads with predicate pushdown
pt_filtered = dataset.filter("timestamp > '2023-01-01'")
# Chaining operations
df_filtered = (
dataset
.filter("column_name > 100")
.pl.with_columns(
pl.col("column_name").str.slice(0, 5).alias("new_column_name")
)
.to_pandas()
)
# Fast metadata-only scans
pt_scanned = dataset.scan("column_name > 100")
# Access matching files
matching_files = ds.scan_files
🔄 Metadata Management
# Incremental metadata update
dataset.load(update_metadata=True) # Update for new files
# Full metadata refresh
dataset.load(reload_metadata=True) # Reload all metadata
# Repair schema/metadata
dataset.repair_schema()
⚡ Performance Optimization Tools
# Optimize storage types
dataset.opt_dtypes() # Automatic type optimization
# Smart file management
dataset.compact_by_rows(max_rows=100_000) # Combine small files
dataset.repartition(partitioning_columns=["date"]) # Optimize partitions
dataset.compact_by_timeperiod(interval="1d") # Time-based optimization
dataset.compact_partitions() # Partition structure optimization
⚠️ Important Notes
Type optimization involves full dataset rewrite Choose compaction strategy based on your access patterns Regular metadata updates ensure optimal query performance
📚 Documentation
For advanced usage and complete API documentation, visit our docs.
🤝 Contributing
Contributions welcome! See our contribution guidelines.
📝 License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydala2-0.9.5.1.tar.gz.
File metadata
- Download URL: pydala2-0.9.5.1.tar.gz
- Upload date:
- Size: 156.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccc5a5d5303d9822d15824aa4617515818287224aba485db559c8eb545f50fd3
|
|
| MD5 |
adf96f990c3c28177a498911cf00228c
|
|
| BLAKE2b-256 |
b21c7ca7462e674c5dcdd03989cff06b5f08fdc791ca8f71419d4303f85326f1
|
File details
Details for the file pydala2-0.9.5.1-py3-none-any.whl.
File metadata
- Download URL: pydala2-0.9.5.1-py3-none-any.whl
- Upload date:
- Size: 57.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d294f044f4382c4d33b8ae8355f3fcd6097b8e1db471f81d32e550bd75aa7363
|
|
| MD5 |
e87fa318aac09733deb3193df92ab1fd
|
|
| BLAKE2b-256 |
63735a623b4a5a3ba732e9373b29b7ca24db6dd275c22cee9a210f0b1af764bf
|