poor man´s data lake

Project description

PyDala2

Overview 📖

Pydala is a high-performance Python library for managing Parquet datasets with powerful metadata capabilities. Built on Apache Arrow, it provides an efficient, user-friendly interface for handling large-scale data operations.

✨ Key Features

📦 Smart Dataset Management: Efficient Parquet handling with metadata optimization
🔄 Robust Caching: Built-in support for faster data access
🔌 Seamless Integration: Works with Polars, PyArrow, and DuckDB
🔍 Advanced Querying: SQL-like filtering with predicate pushdown
🛠️ Schema Management: Automatic validation and tracking

🚀 Quick Start

Installation

pip install pydala2

📊 Creating a Dataset

from pydala.dataset import ParquetDataset

dataset = ParquetDataset(
    path="path/to/dataset",
    partitioning="hive",         # Hive-style partitioning
    timestamp_column="timestamp", # For time-based operations
    cached=True                  # Enable performance caching
)

💾 Writing Data

import polars as pl

# Create sample time-series data
df = pl.DataFrame({
    "timestamp": pl.date_range(0, 1000, "1d"),
    "value": range(1000)
})

# Write with smart partitioning and compression
dataset.write_to_dataset(
    data=df,                    # Can be a polars or pandas DataFrame or an Arrow Table, Dataset, or RecordBatch or a duckdb result 
    mode="overwrite",           # Options: "overwrite", "append", "delta"
    row_group_size=250_000,     # Optimize chunk size
    compression="zstd",         # High-performance compression
    partition_by=["year", "month"], # Auto-partition by time
    unique=True                 # Ensure data uniqueness
)

📥 Reading & Converting Data

dataset.load(update_metadata=True)

# Flexible data format conversion
pt = dataset.t                  # PyDala Table
df_polars = pt.to_polars()      # Convert to Polars
df_pandas = pt.to_pandas()      # Convert to Pandas
df_arrow = pt.to_arrow()        # Convert to Arrow
rel_ddb = pt.to_ddb()           # Convert DuckDB relation

# and many more...

🔍 Smart Querying

# Efficient filtered reads with predicate pushdown
pt_filtered = dataset.filter("timestamp > '2023-01-01'")

# Chaining operations
df_filtered = (
    dataset
    .filter("column_name > 100")
    .pl.with_columns(
        pl.col("column_name").str.slice(0, 5).alias("new_column_name")
        )
    .to_pandas()
    )

# Fast metadata-only scans
pt_scanned = dataset.scan("column_name > 100")

# Access matching files
matching_files = ds.scan_files

🔄 Metadata Management

# Incremental metadata update
dataset.load(update_metadata=True)   # Update for new files

# Full metadata refresh
dataset.load(reload_metadata=True)   # Reload all metadata

# Repair schema/metadata
dataset.repair_schema()

⚡ Performance Optimization Tools

# Optimize storage types
dataset.opt_dtypes()              # Automatic type optimization

# Smart file management
dataset.compact_by_rows(max_rows=100_000)  # Combine small files
dataset.repartition(partitioning_columns=["date"])  # Optimize partitions
dataset.compact_by_timeperiod(interval="1d")  # Time-based optimization
dataset.compact_partitions()  # Partition structure optimization

⚠️ Important Notes

Type optimization involves full dataset rewrite Choose compaction strategy based on your access patterns Regular metadata updates ensure optimal query performance

📚 Documentation

There is a comprehensive tutorial available to help you get started with PyDala2, covering all features and functionalities in detail.

Note: This is generated with Code2Tutorial.

🤝 Contributing

Contributions welcome! See our contribution guidelines.

📝 License

MIT License

Project details

Release history Release notifications | RSS feed

0.22.5

Oct 21, 2025

0.22.4

Oct 20, 2025

0.22.3

Oct 15, 2025

0.22.2

Oct 14, 2025

0.22.1

Oct 14, 2025

0.22.0

Oct 14, 2025

0.21.5

Sep 24, 2025

0.21.4

Sep 24, 2025

0.21.3

Sep 15, 2025

0.21.2

Sep 15, 2025

0.21.1

Sep 15, 2025

This version

0.21.0

Sep 15, 2025

0.20.0

Sep 15, 2025

0.9.9

Jul 29, 2025

0.9.8

Jul 15, 2025

0.9.7.7

May 16, 2025

0.9.7.6

May 15, 2025

0.9.7.4

Mar 26, 2025

0.9.7.3

Mar 18, 2025

0.9.7.2

Mar 18, 2025

0.9.7.1

Mar 18, 2025

0.9.7

Feb 20, 2025

0.9.6

Jan 29, 2025

0.9.5.1

Jan 20, 2025

0.9.5

Jan 20, 2025

0.9.4.5

Dec 18, 2024

0.9.4.4

Dec 18, 2024

0.9.4.3

Dec 18, 2024

0.9.4.2

Dec 18, 2024

0.9.4.1

Dec 18, 2024

0.9.3.18

Dec 12, 2024

0.9.3.17

Dec 11, 2024

0.9.3.16

Dec 11, 2024

0.9.3.15

Dec 11, 2024

0.9.3.14

Dec 10, 2024

0.9.3.13

Dec 10, 2024

0.9.3.12

Dec 9, 2024

0.9.3.11

Dec 9, 2024

0.9.3.10

Dec 9, 2024

0.9.3.9

Dec 9, 2024

0.9.3.8

Dec 9, 2024

0.9.3.7

Dec 9, 2024

0.9.3.6

Dec 9, 2024

0.9.3.5

Dec 9, 2024

0.9.3.4

Dec 9, 2024

0.9.3.3

Dec 9, 2024

0.9.3.2

Dec 6, 2024

0.9.3.1

Dec 5, 2024

0.9.3

Dec 5, 2024

0.9.2.3

Dec 5, 2024

0.9.2.2

Dec 5, 2024

0.9.2.1

Dec 5, 2024

0.9.2

Dec 5, 2024

0.9.1.11

Dec 5, 2024

0.9.1.10

Dec 5, 2024

0.9.1.9

Dec 5, 2024

0.9.1.8

Dec 5, 2024

0.9.1.7

Dec 5, 2024

0.9.1.6

Dec 5, 2024

0.9.1.5

Dec 4, 2024

0.9.1.4

Dec 4, 2024

0.9.1.3

Dec 4, 2024

0.9.1.2

Dec 4, 2024

0.9.1.1

Dec 4, 2024

0.9.1

Dec 4, 2024

0.9.0

Nov 22, 2024

0.8.8.3

Oct 23, 2024

0.8.8.2

Oct 23, 2024

0.8.8.1

Oct 23, 2024

0.8.8

Oct 23, 2024

0.8.7.10

Oct 21, 2024

0.8.7.9

Oct 21, 2024

0.8.7.8

Oct 21, 2024

0.8.7.7

Oct 18, 2024

0.8.7.6

Oct 18, 2024

0.8.7.5

Oct 18, 2024

0.8.7.4

Oct 18, 2024

0.8.7.3

Oct 18, 2024

0.8.7.2

Oct 18, 2024

0.8.7.1

Oct 10, 2024

0.8.7

Aug 21, 2024

0.8.6

Aug 20, 2024

0.8.3.7

Aug 14, 2024

0.8.3.6

Aug 14, 2024

0.8.3.5

Aug 14, 2024

0.8.3.4

Aug 14, 2024

0.8.3.3

Aug 13, 2024

0.8.3.2

Aug 13, 2024

0.8.3.1

Aug 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydala2-0.21.0.tar.gz (305.2 kB view details)

Uploaded Sep 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydala2-0.21.0-py3-none-any.whl (58.3 kB view details)

Uploaded Sep 15, 2025 Python 3

File details

Details for the file pydala2-0.21.0.tar.gz.

File metadata

Download URL: pydala2-0.21.0.tar.gz
Upload date: Sep 15, 2025
Size: 305.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.9

File hashes

Hashes for pydala2-0.21.0.tar.gz
Algorithm	Hash digest
SHA256	`006fd0fdf0aa2e9c8a8fbb64252c69032c7f125b1f7c2771a391fafe6a5aef74`
MD5	`10067c762a83c419c5fd29a92ee7b6c3`
BLAKE2b-256	`d5ae2aa09cf98b2a23d47c479c1845bdbe2fae8552d068fa53f8a5d60d738714`

See more details on using hashes here.

File details

Details for the file pydala2-0.21.0-py3-none-any.whl.

File metadata

Download URL: pydala2-0.21.0-py3-none-any.whl
Upload date: Sep 15, 2025
Size: 58.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.9

File hashes

Hashes for pydala2-0.21.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad99f8a18495f54332454b58912c0a2594691baa6e0c6c0189173e76bc7672d7`
MD5	`1ab3690f390b1ec25e8eef6376aa3d31`
BLAKE2b-256	`f4b3a23e7e75a82b4481676a8ef55e8f75f50c71eec8c25297e6eaf5b0963c66`

See more details on using hashes here.

pydala2 0.21.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

PyDala2

Overview 📖

✨ Key Features

🚀 Quick Start

Installation

📊 Creating a Dataset

💾 Writing Data

📥 Reading & Converting Data

🔍 Smart Querying

🔄 Metadata Management

⚡ Performance Optimization Tools

⚠️ Important Notes

📚 Documentation

🤝 Contributing

📝 License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes