Skip to main content

A Pandas-like data manipulation framework with automatic SQL generation

Project description

DataStore

CI/CD codecov PyPI version Python versions License

⚠️ EXPERIMENTAL: This project is currently in experimental stage. APIs may change without notice. Not recommended for production use yet.

A Pandas-like data manipulation framework powered by chDB (ClickHouse) with automatic SQL generation and execution capabilities. Query files, databases, and cloud storage with a unified interface.

Features

  • Fluent API: Pandas-like interface for data manipulation
  • Immutable Operations: Thread-safe method chaining
  • Unified Interface: Query files, databases, and cloud storage with the same API
  • 20+ Data Sources: Local files, S3, Azure, GCS, HDFS, MySQL, PostgreSQL, MongoDB, Redis, SQLite, ClickHouse, and more
  • Data Lake Support: Iceberg, Delta Lake, Hudi table formats
  • Format Auto-Detection: Automatically detect file formats from extensions
  • SQL Generation: Automatic conversion to optimized SQL queries
  • Type-Safe: Comprehensive type hints and validation
  • Extensible: Easy to add custom functions and data sources

Quick Start

Installation

pip install chdb-ds

Basic Usage

from datastore import DataStore

# Query local files
ds = DataStore.from_file("data.parquet")
result = ds.select("*").filter(ds.age > 18).execute()

# Query S3
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
result = ds.select("name", "age").limit(10).execute()

# Query MySQL
ds = DataStore.from_mysql(
    host="localhost:3306",
    database="mydb",
    table="users",
    user="root",
    password="pass"
)
result = ds.select("*").filter(ds.active == True).execute()

# Build complex queries with method chaining
query = (ds
    .select("name", "age", "city")
    .filter(ds.age > 18)
    .filter(ds.city == "NYC")
    .sort("name")
    .limit(10))

# Generate SQL
print(query.to_sql())
# Output: SELECT "name", "age", "city" FROM mysql(...) 
#         WHERE ("age" > 18 AND "city" = 'NYC') 
#         ORDER BY "name" ASC LIMIT 10

# Execute query
result = query.execute()

Working with Expressions

from datastore import Field, Sum, Count

# Arithmetic operations
ds.select(
    ds.price * 1.1,  # 10% price increase
    (ds.revenue - ds.cost).as_("profit")
)

# Aggregate functions
ds.groupby("category").select(
    Field("category"),
    Sum(Field("amount"), alias="total"),
    Count("*", alias="count")
)

Conditions

# Simple conditions
ds.filter(ds.age > 18)
ds.filter(ds.status == "active")

# Complex conditions
ds.filter(
    ((ds.age > 18) & (ds.age < 65)) | 
    (ds.status == "premium")
)

# Negation
ds.filter(~(ds.deleted == True))

Supported Data Sources

DataStore provides factory methods for easy data source creation:

Local Files

# Automatically detect format from extension
ds = DataStore.from_file("data.parquet")
ds = DataStore.from_file("data.csv", format="CSV")
ds = DataStore.from_file("data.json", format="JSONEachRow")

Cloud Storage

# Amazon S3
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = DataStore.from_s3("s3://bucket/*.csv", 
                       access_key_id="KEY",
                       secret_access_key="SECRET")

# Azure Blob Storage
ds = DataStore.from_azure(
    connection_string="DefaultEndpointsProtocol=https;...",
    container="mycontainer",
    path="data/*.parquet"
)

# Google Cloud Storage
ds = DataStore.from_gcs("gs://bucket/data.parquet",
                        hmac_key="KEY",
                        hmac_secret="SECRET")

# HDFS
ds = DataStore.from_hdfs("hdfs://namenode:9000/data/*.parquet")

Databases

# MySQL
ds = DataStore.from_mysql("localhost:3306", "mydb", "users",
                          user="root", password="pass")

# PostgreSQL
ds = DataStore.from_postgresql("localhost:5432", "mydb", "users",
                               user="postgres", password="pass")

# ClickHouse (remote)
ds = DataStore.from_clickhouse("localhost:9000", "default", "events")

# MongoDB (read-only)
ds = DataStore.from_mongodb("localhost:27017", "mydb", "users",
                            user="admin", password="pass")

# SQLite (read-only)
ds = DataStore.from_sqlite("/path/to/database.db", "users")

# Redis
ds = DataStore.from_redis("localhost:6379", 
                          key="key",
                          structure="key String, value String")

Data Lakes

# Apache Iceberg (read-only)
ds = DataStore.from_iceberg("s3://warehouse/my_table",
                            access_key_id="KEY",
                            secret_access_key="SECRET")

# Delta Lake (read-only)
ds = DataStore.from_delta("s3://bucket/delta_table",
                          access_key_id="KEY",
                          secret_access_key="SECRET")

# Apache Hudi (read-only)
ds = DataStore.from_hudi("s3://bucket/hudi_table",
                         access_key_id="KEY",
                         secret_access_key="SECRET")

Data Generation

# Generate number sequences
ds = DataStore.from_numbers(100)  # 0 to 99
ds = DataStore.from_numbers(10, start=10)  # 10 to 19
ds = DataStore.from_numbers(10, start=0, step=2)  # Even numbers

# Generate random data for testing
ds = DataStore.from_random(
    structure="id UInt32, name String, value Float64",
    random_seed=42,
    max_string_length=20
)

URL/HTTP

ds = DataStore.from_url("https://example.com/data.json",
                        format="JSONEachRow")

Multi-Source Queries

# Join data from different sources
csv_data = DataStore.from_file("sales.csv", format="CSV")
mysql_data = DataStore.from_mysql("localhost:3306", "mydb", "customers",
                                  user="root", password="pass")

result = (mysql_data
    .join(csv_data, left_on="id", right_on="customer_id")
    .select("name", "product", "revenue")
    .filter(csv_data.date >= '2024-01-01')
    .execute())

Format Settings

Optimize performance with format-specific settings:

# CSV settings
ds = DataStore.from_file("data.csv", format="CSV")
ds = ds.with_format_settings(
    format_csv_delimiter=',',
    input_format_csv_skip_first_lines=1,
    input_format_csv_trim_whitespaces=1
)

# Parquet optimization
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = ds.with_format_settings(
    input_format_parquet_filter_push_down=1,
    input_format_parquet_bloom_filter_push_down=1
)

# JSON settings
ds = DataStore.from_file("data.json", format="JSONEachRow")
ds = ds.with_format_settings(
    input_format_json_validate_types_from_metadata=1,
    input_format_json_ignore_unnecessary_fields=1
)

Design Philosophy

DataStore is inspired by pypika's excellent query builder design but focuses on:

  1. High-level API: Pandas-like interface for data scientists
  2. Query Execution: Built-in execution capabilities (not just SQL generation)
  3. Data Source Abstraction: Unified interface across different backends
  4. Modern Python: Type hints, dataclasses, and Python 3.7+ features

Key Design Patterns

1. Immutability via @immutable Decorator

from datastore.utils import immutable

class DataStore:
    @immutable
    def select(self, *fields):
        self._select_fields.extend(fields)
        # Decorator handles copying and returning new instance

2. Operator Overloading

# Natural Python syntax
ds.age > 18          # BinaryCondition('>', Field('age'), Literal(18))
ds.price * 1.1       # ArithmeticExpression('*', Field('price'), Literal(1.1))
(cond1) & (cond2)    # CompoundCondition('AND', cond1, cond2)

3. Smart Value Wrapping

Expression.wrap(42)        # Literal(42)
Expression.wrap("hello")   # Literal("hello")
Expression.wrap(None)      # Literal(None)
Expression.wrap(Field('x'))# Field('x') (unchanged)

Development

Running Tests

# Run all tests
python -m pytest datastore/tests/

# Run specific test file
python -m pytest datastore/tests/test_expressions.py

# Run with coverage
python -m pytest --cov=datastore datastore/tests/

# Generate HTML coverage report
python -m pytest --cov=datastore --cov-report=html datastore/tests/
# Open htmlcov/index.html in browser to view detailed coverage

Running Individual Test Modules

# Test expressions
python -m unittest datastore.tests.test_expressions

# Test conditions
python -m unittest datastore.tests.test_conditions

# Test functions
python -m unittest datastore.tests.test_functions

# Test core DataStore
python -m unittest datastore.tests.test_datastore_core

Roadmap

  • Core expression system
  • Condition system
  • Function system
  • Basic DataStore operations
  • Immutability support
  • ClickHouse table functions and formats support
  • DataFrame operations (drop, assign, fillna, etc.)
  • Query executors
  • Multiple backend support
  • Mock data support
  • Schema management(infer or set manually)
  • ClickHouse functions support
  • Connection managers
  • Image, Video, Audio data support
  • PyTorch DataLoader integration

Examples

For more comprehensive examples, see:

  • examples/examples_table_functions.py - Complete examples for all data sources including:
    • Local files (CSV, Parquet, JSON, ORC, Avro and 80+ formats)
    • Cloud storage (S3, Azure, GCS, HDFS, HTTP and 20+ protocols)
    • Databases (MySQL, PostgreSQL, MongoDB, Redis, SQLite, ClickHouse)
    • Data lakes (Iceberg, Delta Lake, Hudi)
    • Data generation (numbers, random data)
    • Multi-source joins
    • Format-specific optimization settings

License

Apache License 2.0

Credits

Built with and inspired by:

  • chDB - Embedded ClickHouse engine for Python
  • ClickHouse - Fast open-source OLAP database
  • Pandas - DataFrame API design
  • PyPika - Query builder patterns
  • SQLAlchemy - ORM and query builder concepts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chdb_ds-0.0.2.tar.gz (88.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chdb_ds-0.0.2-py3-none-any.whl (42.0 kB view details)

Uploaded Python 3

File details

Details for the file chdb_ds-0.0.2.tar.gz.

File metadata

  • Download URL: chdb_ds-0.0.2.tar.gz
  • Upload date:
  • Size: 88.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chdb_ds-0.0.2.tar.gz
Algorithm Hash digest
SHA256 a7856f8e4814ff101682e5c8a5d96e5486a938dfbe7923f11301071c42df4a29
MD5 e9f467b6704cd02cdf3f606c8141da2e
BLAKE2b-256 2d3969e0e1f45aa8480d882b62a0104fea0f0a05890900879e93c3def6ef7dbf

See more details on using hashes here.

Provenance

The following attestation bundles were made for chdb_ds-0.0.2.tar.gz:

Publisher: publish.yml on auxten/chdb-ds

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chdb_ds-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: chdb_ds-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 42.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chdb_ds-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 202582a9372b390b09327364eeef6d86aaeb966ca805573ad55445adbddea926
MD5 f9935bdf460a572e89d08dda440c841a
BLAKE2b-256 db15a3090596133bc3a70918534298bdd2e3b821f29d2b14bdbb540c471a76c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for chdb_ds-0.0.2-py3-none-any.whl:

Publisher: publish.yml on auxten/chdb-ds

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page