Skip to main content

A Pandas-like data manipulation framework with automatic SQL generation

Project description

DataStore

CI/CD codecov PyPI version Python versions License

⚠️ EXPERIMENTAL: This project is currently in experimental stage. APIs may change without notice. Not recommended for production use yet.

A Pandas-like data manipulation framework powered by chDB (ClickHouse) with automatic SQL generation and execution capabilities. Query files, databases, and cloud storage with a unified interface.

Features

  • Fluent API: Pandas-like interface for data manipulation
  • Immutable Operations: Thread-safe method chaining
  • Unified Interface: Query files, databases, and cloud storage with the same API
  • 20+ Data Sources: Local files, S3, Azure, GCS, HDFS, MySQL, PostgreSQL, MongoDB, Redis, SQLite, ClickHouse, and more
  • Data Lake Support: Iceberg, Delta Lake, Hudi table formats
  • Format Auto-Detection: Automatically detect file formats from extensions
  • SQL Generation: Automatic conversion to optimized SQL queries
  • Type-Safe: Comprehensive type hints and validation
  • Extensible: Easy to add custom functions and data sources

Quick Start

Installation

pip install chdb-ds

Basic Usage

from datastore import DataStore

# Query local files
ds = DataStore.from_file("data.parquet")
result = ds.select("*").filter(ds.age > 18).execute()

# Query S3
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
result = ds.select("name", "age").limit(10).execute()

# Query MySQL
ds = DataStore.from_mysql(
    host="localhost:3306",
    database="mydb",
    table="users",
    user="root",
    password="pass"
)
result = ds.select("*").filter(ds.active == True).execute()

# Build complex queries with method chaining
query = (ds
    .select("name", "age", "city")
    .filter(ds.age > 18)
    .filter(ds.city == "NYC")
    .sort("name")
    .limit(10))

# Generate SQL
print(query.to_sql())
# Output: SELECT "name", "age", "city" FROM mysql(...) 
#         WHERE ("age" > 18 AND "city" = 'NYC') 
#         ORDER BY "name" ASC LIMIT 10

# Execute query
result = query.execute()

Working with Expressions

from datastore import Field, Sum, Count

# Arithmetic operations
ds.select(
    ds.price * 1.1,  # 10% price increase
    (ds.revenue - ds.cost).as_("profit")
)

# Aggregate functions
ds.groupby("category").select(
    Field("category"),
    Sum(Field("amount"), alias="total"),
    Count("*", alias="count")
)

Conditions

# Simple conditions
ds.filter(ds.age > 18)
ds.filter(ds.status == "active")

# Complex conditions
ds.filter(
    ((ds.age > 18) & (ds.age < 65)) | 
    (ds.status == "premium")
)

# Negation
ds.filter(~(ds.deleted == True))

Supported Data Sources

DataStore provides factory methods for easy data source creation:

Local Files

# Automatically detect format from extension
ds = DataStore.from_file("data.parquet")
ds = DataStore.from_file("data.csv", format="CSV")
ds = DataStore.from_file("data.json", format="JSONEachRow")

Cloud Storage

# Amazon S3
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = DataStore.from_s3("s3://bucket/*.csv", 
                       access_key_id="KEY",
                       secret_access_key="SECRET")

# Azure Blob Storage
ds = DataStore.from_azure(
    connection_string="DefaultEndpointsProtocol=https;...",
    container="mycontainer",
    path="data/*.parquet"
)

# Google Cloud Storage
ds = DataStore.from_gcs("gs://bucket/data.parquet",
                        hmac_key="KEY",
                        hmac_secret="SECRET")

# HDFS
ds = DataStore.from_hdfs("hdfs://namenode:9000/data/*.parquet")

Databases

# MySQL
ds = DataStore.from_mysql("localhost:3306", "mydb", "users",
                          user="root", password="pass")

# PostgreSQL
ds = DataStore.from_postgresql("localhost:5432", "mydb", "users",
                               user="postgres", password="pass")

# ClickHouse (remote)
ds = DataStore.from_clickhouse("localhost:9000", "default", "events")

# MongoDB (read-only)
ds = DataStore.from_mongodb("localhost:27017", "mydb", "users",
                            user="admin", password="pass")

# SQLite (read-only)
ds = DataStore.from_sqlite("/path/to/database.db", "users")

# Redis
ds = DataStore.from_redis("localhost:6379", 
                          key="key",
                          structure="key String, value String")

Data Lakes

# Apache Iceberg (read-only)
ds = DataStore.from_iceberg("s3://warehouse/my_table",
                            access_key_id="KEY",
                            secret_access_key="SECRET")

# Delta Lake (read-only)
ds = DataStore.from_delta("s3://bucket/delta_table",
                          access_key_id="KEY",
                          secret_access_key="SECRET")

# Apache Hudi (read-only)
ds = DataStore.from_hudi("s3://bucket/hudi_table",
                         access_key_id="KEY",
                         secret_access_key="SECRET")

Data Generation

# Generate number sequences
ds = DataStore.from_numbers(100)  # 0 to 99
ds = DataStore.from_numbers(10, start=10)  # 10 to 19
ds = DataStore.from_numbers(10, start=0, step=2)  # Even numbers

# Generate random data for testing
ds = DataStore.from_random(
    structure="id UInt32, name String, value Float64",
    random_seed=42,
    max_string_length=20
)

URL/HTTP

ds = DataStore.from_url("https://example.com/data.json",
                        format="JSONEachRow")

Multi-Source Queries

# Join data from different sources
csv_data = DataStore.from_file("sales.csv", format="CSV")
mysql_data = DataStore.from_mysql("localhost:3306", "mydb", "customers",
                                  user="root", password="pass")

result = (mysql_data
    .join(csv_data, left_on="id", right_on="customer_id")
    .select("name", "product", "revenue")
    .filter(csv_data.date >= '2024-01-01')
    .execute())

Format Settings

Optimize performance with format-specific settings:

# CSV settings
ds = DataStore.from_file("data.csv", format="CSV")
ds = ds.with_format_settings(
    format_csv_delimiter=',',
    input_format_csv_skip_first_lines=1,
    input_format_csv_trim_whitespaces=1
)

# Parquet optimization
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = ds.with_format_settings(
    input_format_parquet_filter_push_down=1,
    input_format_parquet_bloom_filter_push_down=1
)

# JSON settings
ds = DataStore.from_file("data.json", format="JSONEachRow")
ds = ds.with_format_settings(
    input_format_json_validate_types_from_metadata=1,
    input_format_json_ignore_unnecessary_fields=1
)

Design Philosophy

DataStore is inspired by pypika's excellent query builder design but focuses on:

  1. High-level API: Pandas-like interface for data scientists
  2. Query Execution: Built-in execution capabilities (not just SQL generation)
  3. Data Source Abstraction: Unified interface across different backends
  4. Modern Python: Type hints, dataclasses, and Python 3.7+ features

Key Design Patterns

1. Immutability via @immutable Decorator

from datastore.utils import immutable

class DataStore:
    @immutable
    def select(self, *fields):
        self._select_fields.extend(fields)
        # Decorator handles copying and returning new instance

2. Operator Overloading

# Natural Python syntax
ds.age > 18          # BinaryCondition('>', Field('age'), Literal(18))
ds.price * 1.1       # ArithmeticExpression('*', Field('price'), Literal(1.1))
(cond1) & (cond2)    # CompoundCondition('AND', cond1, cond2)

3. Smart Value Wrapping

Expression.wrap(42)        # Literal(42)
Expression.wrap("hello")   # Literal("hello")
Expression.wrap(None)      # Literal(None)
Expression.wrap(Field('x'))# Field('x') (unchanged)

Development

Running Tests

# Run all tests
python -m pytest datastore/tests/

# Run specific test file
python -m pytest datastore/tests/test_expressions.py

# Run with coverage
python -m pytest --cov=datastore datastore/tests/

# Generate HTML coverage report
python -m pytest --cov=datastore --cov-report=html datastore/tests/
# Open htmlcov/index.html in browser to view detailed coverage

Running Individual Test Modules

# Test expressions
python -m unittest datastore.tests.test_expressions

# Test conditions
python -m unittest datastore.tests.test_conditions

# Test functions
python -m unittest datastore.tests.test_functions

# Test core DataStore
python -m unittest datastore.tests.test_datastore_core

Roadmap

  • Core expression system
  • Condition system
  • Function system
  • Basic DataStore operations
  • Immutability support
  • ClickHouse table functions and formats support
  • DataFrame operations (drop, assign, fillna, etc.)
  • Query executors
  • Multiple backend support
  • Mock data support
  • Schema management(infer or set manually)
  • ClickHouse functions support
  • Connection managers
  • Image, Video, Audio data support
  • PyTorch DataLoader integration

Examples

For more comprehensive examples, see:

  • examples/examples_table_functions.py - Complete examples for all data sources including:
    • Local files (CSV, Parquet, JSON, ORC, Avro and 80+ formats)
    • Cloud storage (S3, Azure, GCS, HDFS, HTTP and 20+ protocols)
    • Databases (MySQL, PostgreSQL, MongoDB, Redis, SQLite, ClickHouse)
    • Data lakes (Iceberg, Delta Lake, Hudi)
    • Data generation (numbers, random data)
    • Multi-source joins
    • Format-specific optimization settings

License

Apache License 2.0

Credits

Built with and inspired by:

  • chDB - Embedded ClickHouse engine for Python
  • ClickHouse - Fast open-source OLAP database
  • Pandas - DataFrame API design
  • PyPika - Query builder patterns
  • SQLAlchemy - ORM and query builder concepts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chdb_ds-0.0.0.tar.gz (88.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chdb_ds-0.0.0-py3-none-any.whl (41.8 kB view details)

Uploaded Python 3

File details

Details for the file chdb_ds-0.0.0.tar.gz.

File metadata

  • Download URL: chdb_ds-0.0.0.tar.gz
  • Upload date:
  • Size: 88.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chdb_ds-0.0.0.tar.gz
Algorithm Hash digest
SHA256 02b1946d3814a4539a3de1c666315fa896eda8dad4c7fca007933e82fa9611ea
MD5 21fe45d80d53e42d1c7f4d7176925fb4
BLAKE2b-256 a4b3ad0d4d34fc77d65d7d72ccab571cfcf0c4a1cc33e6dc480d77c1d689be00

See more details on using hashes here.

Provenance

The following attestation bundles were made for chdb_ds-0.0.0.tar.gz:

Publisher: publish.yml on auxten/chdb-ds

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chdb_ds-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: chdb_ds-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 41.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chdb_ds-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b71aff9f41542e56797b4446f2183e475e2f30e7a3e5ca1eaf181656dec70d38
MD5 04e01188bf7ce1bd1bc9e4b12d9c9784
BLAKE2b-256 63201ecc7ec1a97ddd844297582f0d7001feca0c97daaf552bb686f53ae65584

See more details on using hashes here.

Provenance

The following attestation bundles were made for chdb_ds-0.0.0-py3-none-any.whl:

Publisher: publish.yml on auxten/chdb-ds

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page