Skip to main content

poor man´s data lake

Project description

PyDala2

PyDala2

PyPI version License: MIT

Overview 📖

Pydala is a high-performance Python library for managing Parquet datasets with powerful metadata capabilities. Built on Apache Arrow, it provides an efficient, user-friendly interface for handling large-scale data operations.

✨ Key Features

  • 📦 Smart Dataset Management: Efficient Parquet handling with metadata optimization
  • 🔄 Robust Caching: Built-in support for faster data access
  • 🔌 Seamless Integration: Works with Polars, PyArrow, and DuckDB
  • 🔍 Advanced Querying: SQL-like filtering with predicate pushdown
  • 🛠️ Schema Management: Automatic validation and tracking

🚀 Quick Start

Installation

pip install pydala2

📊 Creating a Dataset

from pydala.dataset import ParquetDataset

dataset = ParquetDataset(
    path="path/to/dataset",
    partitioning="hive",         # Hive-style partitioning
    timestamp_column="timestamp", # For time-based operations
    cached=True                  # Enable performance caching
)

💾 Writing Data

import polars as pl

# Create sample time-series data
df = pl.DataFrame({
    "timestamp": pl.date_range(0, 1000, "1d"),
    "value": range(1000)
})

# Write with smart partitioning and compression
dataset.write_to_dataset(
    data=df,                    # Can be a polars or pandas DataFrame or an Arrow Table, Dataset, or RecordBatch or a duckdb result 
    mode="overwrite",           # Options: "overwrite", "append", "delta"
    row_group_size=250_000,     # Optimize chunk size
    compression="zstd",         # High-performance compression
    partition_by=["year", "month"], # Auto-partition by time
    unique=True                 # Ensure data uniqueness
)

📥 Reading & Converting Data

dataset.load(update_metadata=True)

# Flexible data format conversion
pt = dataset.t                  # PyDala Table
df_polars = pt.to_polars()      # Convert to Polars
df_pandas = pt.to_pandas()      # Convert to Pandas
df_arrow = pt.to_arrow()        # Convert to Arrow
rel_ddb = pt.to_ddb()           # Convert DuckDB relation

# and many more... 

🔍 Smart Querying

# Efficient filtered reads with predicate pushdown
pt_filtered = dataset.filter("timestamp > '2023-01-01'")

# Chaining operations
df_filtered = (
    dataset
    .filter("column_name > 100")
    .pl.with_columns(
        pl.col("column_name").str.slice(0, 5).alias("new_column_name")
        )
    .to_pandas()
    )

# Fast metadata-only scans
pt_scanned = dataset.scan("column_name > 100")

# Access matching files
matching_files = ds.scan_files

🔄 Metadata Management

# Incremental metadata update
dataset.load(update_metadata=True)   # Update for new files

# Full metadata refresh
dataset.load(reload_metadata=True)   # Reload all metadata

# Repair schema/metadata
dataset.repair_schema()

⚡ Performance Optimization Tools

# Optimize storage types
dataset.opt_dtypes()              # Automatic type optimization

# Smart file management
dataset.compact_by_rows(max_rows=100_000)  # Combine small files
dataset.repartition(partitioning_columns=["date"])  # Optimize partitions
dataset.compact_by_timeperiod(interval="1d")  # Time-based optimization
dataset.compact_partitions()  # Partition structure optimization

⚠️ Important Notes

Type optimization involves full dataset rewrite Choose compaction strategy based on your access patterns Regular metadata updates ensure optimal query performance

📚 Documentation

For advanced usage and complete API documentation, visit our docs.

🤝 Contributing

Contributions welcome! See our contribution guidelines.

📝 License

MIT License

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydala2-0.9.4.4.tar.gz (157.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydala2-0.9.4.4-py3-none-any.whl (57.5 kB view details)

Uploaded Python 3

File details

Details for the file pydala2-0.9.4.4.tar.gz.

File metadata

  • Download URL: pydala2-0.9.4.4.tar.gz
  • Upload date:
  • Size: 157.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.8

File hashes

Hashes for pydala2-0.9.4.4.tar.gz
Algorithm Hash digest
SHA256 40eb0769ece395e23e2fae8f788732fbdff49c533fc15bdb2c7470d97a55e4c5
MD5 4c31cd96635f067fe5c14d8f3150e8b4
BLAKE2b-256 f2ca1de327ef6c773d5681094b44a2dfd59322bc24a7b0605dfe10ad9adc8138

See more details on using hashes here.

File details

Details for the file pydala2-0.9.4.4-py3-none-any.whl.

File metadata

  • Download URL: pydala2-0.9.4.4-py3-none-any.whl
  • Upload date:
  • Size: 57.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.8

File hashes

Hashes for pydala2-0.9.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 167a2571f0d94ec7040e8ce401db853a070f7d5c3d73c507e8d6f2aaa53ded62
MD5 f3865850a8a58b0f9afdb2872d65b618
BLAKE2b-256 c47770ac1ee2d66bf83a47b7352d64a026788f8e178ca9b91f89797b2cbb8d61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page