A Pandas-like data manipulation framework with automatic SQL generation

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

DataStore

⚠️ EXPERIMENTAL: This project is currently in experimental stage. APIs may change without notice. Not recommended for production use yet.

A Pandas-like data manipulation framework powered by chDB (ClickHouse) with automatic SQL generation and execution capabilities. Query files, databases, and cloud storage with a unified interface.

Features

Fluent API: Pandas-like interface for data manipulation
Immutable Operations: Thread-safe method chaining
Unified Interface: Query files, databases, and cloud storage with the same API
20+ Data Sources: Local files, S3, Azure, GCS, HDFS, MySQL, PostgreSQL, MongoDB, Redis, SQLite, ClickHouse, and more
Data Lake Support: Iceberg, Delta Lake, Hudi table formats
Format Auto-Detection: Automatically detect file formats from extensions
SQL Generation: Automatic conversion to optimized SQL queries
Type-Safe: Comprehensive type hints and validation
Extensible: Easy to add custom functions and data sources

Quick Start

Installation

pip install chdb-ds

Basic Usage

from datastore import DataStore

# Query local files
ds = DataStore.from_file("data.parquet")
result = ds.select("*").filter(ds.age > 18).execute()

# Query S3
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
result = ds.select("name", "age").limit(10).execute()

# Query MySQL
ds = DataStore.from_mysql(
    host="localhost:3306",
    database="mydb",
    table="users",
    user="root",
    password="pass"
)
result = ds.select("*").filter(ds.active == True).execute()

# Build complex queries with method chaining
query = (ds
    .select("name", "age", "city")
    .filter(ds.age > 18)
    .filter(ds.city == "NYC")
    .sort("name")
    .limit(10))

# Generate SQL
print(query.to_sql())
# Output: SELECT "name", "age", "city" FROM mysql(...) 
#         WHERE ("age" > 18 AND "city" = 'NYC') 
#         ORDER BY "name" ASC LIMIT 10

# Execute query
result = query.execute()

Working with Expressions

from datastore import Field, Sum, Count

# Arithmetic operations
ds.select(
    ds.price * 1.1,  # 10% price increase
    (ds.revenue - ds.cost).as_("profit")
)

# Aggregate functions
ds.groupby("category").select(
    Field("category"),
    Sum(Field("amount"), alias="total"),
    Count("*", alias="count")
)

Conditions

# Simple conditions
ds.filter(ds.age > 18)
ds.filter(ds.status == "active")

# Complex conditions
ds.filter(
    ((ds.age > 18) & (ds.age < 65)) | 
    (ds.status == "premium")
)

# Negation
ds.filter(~(ds.deleted == True))

Supported Data Sources

DataStore provides factory methods for easy data source creation:

Local Files

# Automatically detect format from extension
ds = DataStore.from_file("data.parquet")
ds = DataStore.from_file("data.csv", format="CSV")
ds = DataStore.from_file("data.json", format="JSONEachRow")

Cloud Storage

# Amazon S3
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = DataStore.from_s3("s3://bucket/*.csv", 
                       access_key_id="KEY",
                       secret_access_key="SECRET")

# Azure Blob Storage
ds = DataStore.from_azure(
    connection_string="DefaultEndpointsProtocol=https;...",
    container="mycontainer",
    path="data/*.parquet"
)

# Google Cloud Storage
ds = DataStore.from_gcs("gs://bucket/data.parquet",
                        hmac_key="KEY",
                        hmac_secret="SECRET")

# HDFS
ds = DataStore.from_hdfs("hdfs://namenode:9000/data/*.parquet")

Databases

# MySQL
ds = DataStore.from_mysql("localhost:3306", "mydb", "users",
                          user="root", password="pass")

# PostgreSQL
ds = DataStore.from_postgresql("localhost:5432", "mydb", "users",
                               user="postgres", password="pass")

# ClickHouse (remote)
ds = DataStore.from_clickhouse("localhost:9000", "default", "events")

# MongoDB (read-only)
ds = DataStore.from_mongodb("localhost:27017", "mydb", "users",
                            user="admin", password="pass")

# SQLite (read-only)
ds = DataStore.from_sqlite("/path/to/database.db", "users")

# Redis
ds = DataStore.from_redis("localhost:6379", 
                          key="key",
                          structure="key String, value String")

Data Lakes

# Apache Iceberg (read-only)
ds = DataStore.from_iceberg("s3://warehouse/my_table",
                            access_key_id="KEY",
                            secret_access_key="SECRET")

# Delta Lake (read-only)
ds = DataStore.from_delta("s3://bucket/delta_table",
                          access_key_id="KEY",
                          secret_access_key="SECRET")

# Apache Hudi (read-only)
ds = DataStore.from_hudi("s3://bucket/hudi_table",
                         access_key_id="KEY",
                         secret_access_key="SECRET")

Data Generation

# Generate number sequences
ds = DataStore.from_numbers(100)  # 0 to 99
ds = DataStore.from_numbers(10, start=10)  # 10 to 19
ds = DataStore.from_numbers(10, start=0, step=2)  # Even numbers

# Generate random data for testing
ds = DataStore.from_random(
    structure="id UInt32, name String, value Float64",
    random_seed=42,
    max_string_length=20
)

URL/HTTP

ds = DataStore.from_url("https://example.com/data.json",
                        format="JSONEachRow")

Multi-Source Queries

# Join data from different sources
csv_data = DataStore.from_file("sales.csv", format="CSV")
mysql_data = DataStore.from_mysql("localhost:3306", "mydb", "customers",
                                  user="root", password="pass")

result = (mysql_data
    .join(csv_data, left_on="id", right_on="customer_id")
    .select("name", "product", "revenue")
    .filter(csv_data.date >= '2024-01-01')
    .execute())

Format Settings

Optimize performance with format-specific settings:

# CSV settings
ds = DataStore.from_file("data.csv", format="CSV")
ds = ds.with_format_settings(
    format_csv_delimiter=',',
    input_format_csv_skip_first_lines=1,
    input_format_csv_trim_whitespaces=1
)

# Parquet optimization
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = ds.with_format_settings(
    input_format_parquet_filter_push_down=1,
    input_format_parquet_bloom_filter_push_down=1
)

# JSON settings
ds = DataStore.from_file("data.json", format="JSONEachRow")
ds = ds.with_format_settings(
    input_format_json_validate_types_from_metadata=1,
    input_format_json_ignore_unnecessary_fields=1
)

Design Philosophy

DataStore is inspired by pypika's excellent query builder design but focuses on:

High-level API: Pandas-like interface for data scientists
Query Execution: Built-in execution capabilities (not just SQL generation)
Data Source Abstraction: Unified interface across different backends
Modern Python: Type hints, dataclasses, and Python 3.7+ features

Key Design Patterns

1. Immutability via @immutable Decorator

from datastore.utils import immutable

class DataStore:
    @immutable
    def select(self, *fields):
        self._select_fields.extend(fields)
        # Decorator handles copying and returning new instance

2. Operator Overloading

# Natural Python syntax
ds.age > 18          # BinaryCondition('>', Field('age'), Literal(18))
ds.price * 1.1       # ArithmeticExpression('*', Field('price'), Literal(1.1))
(cond1) & (cond2)    # CompoundCondition('AND', cond1, cond2)

3. Smart Value Wrapping

Expression.wrap(42)        # Literal(42)
Expression.wrap("hello")   # Literal("hello")
Expression.wrap(None)      # Literal(None)
Expression.wrap(Field('x'))# Field('x') (unchanged)

Development

Running Tests

# Run all tests
python -m pytest datastore/tests/

# Run specific test file
python -m pytest datastore/tests/test_expressions.py

# Run with coverage
python -m pytest --cov=datastore datastore/tests/

# Generate HTML coverage report
python -m pytest --cov=datastore --cov-report=html datastore/tests/
# Open htmlcov/index.html in browser to view detailed coverage

Running Individual Test Modules

# Test expressions
python -m unittest datastore.tests.test_expressions

# Test conditions
python -m unittest datastore.tests.test_conditions

# Test functions
python -m unittest datastore.tests.test_functions

# Test core DataStore
python -m unittest datastore.tests.test_datastore_core

Roadmap

Core expression system
Condition system
Function system
Basic DataStore operations
Immutability support
ClickHouse table functions and formats support
DataFrame operations (drop, assign, fillna, etc.)
Query executors
Multiple backend support
Mock data support
Schema management(infer or set manually)
ClickHouse functions support
Connection managers
Image, Video, Audio data support
PyTorch DataLoader integration

Examples

For more comprehensive examples, see:

examples/examples_table_functions.py - Complete examples for all data sources including:
- Local files (CSV, Parquet, JSON, ORC, Avro and 80+ formats)
- Cloud storage (S3, Azure, GCS, HDFS, HTTP and 20+ protocols)
- Databases (MySQL, PostgreSQL, MongoDB, Redis, SQLite, ClickHouse)
- Data lakes (Iceberg, Delta Lake, Hudi)
- Data generation (numbers, random data)
- Multi-source joins
- Format-specific optimization settings

License

Apache License 2.0

Credits

Built with and inspired by:

chDB - Embedded ClickHouse engine for Python
ClickHouse - Fast open-source OLAP database
Pandas - DataFrame API design
PyPika - Query builder patterns
SQLAlchemy - ORM and query builder concepts

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Auxten.wpc

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.8

Jan 16, 2026

0.0.7

Dec 19, 2025

0.0.6

Dec 19, 2025

0.0.5

Dec 15, 2025

0.0.3

Oct 12, 2025

0.0.2

Oct 8, 2025

This version

0.0.0

Oct 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chdb_ds-0.0.0.tar.gz (88.6 kB view details)

Uploaded Oct 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chdb_ds-0.0.0-py3-none-any.whl (41.8 kB view details)

Uploaded Oct 6, 2025 Python 3

File details

Details for the file chdb_ds-0.0.0.tar.gz.

File metadata

Download URL: chdb_ds-0.0.0.tar.gz
Upload date: Oct 6, 2025
Size: 88.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chdb_ds-0.0.0.tar.gz
Algorithm	Hash digest
SHA256	`02b1946d3814a4539a3de1c666315fa896eda8dad4c7fca007933e82fa9611ea`
MD5	`21fe45d80d53e42d1c7f4d7176925fb4`
BLAKE2b-256	`a4b3ad0d4d34fc77d65d7d72ccab571cfcf0c4a1cc33e6dc480d77c1d689be00`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chdb_ds-0.0.0.tar.gz:

Publisher: publish.yml on auxten/chdb-ds

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chdb_ds-0.0.0.tar.gz
- Subject digest: 02b1946d3814a4539a3de1c666315fa896eda8dad4c7fca007933e82fa9611ea
- Sigstore transparency entry: 584799194
- Sigstore integration time: Oct 6, 2025
Source repository:
- Permalink: auxten/chdb-ds@4b23688b2e6d8003d45444c3db9d8d054087cfc7
- Branch / Tag: refs/tags/v0.0.0
- Owner: https://github.com/auxten
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4b23688b2e6d8003d45444c3db9d8d054087cfc7
- Trigger Event: push

File details

Details for the file chdb_ds-0.0.0-py3-none-any.whl.

File metadata

Download URL: chdb_ds-0.0.0-py3-none-any.whl
Upload date: Oct 6, 2025
Size: 41.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chdb_ds-0.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b71aff9f41542e56797b4446f2183e475e2f30e7a3e5ca1eaf181656dec70d38`
MD5	`04e01188bf7ce1bd1bc9e4b12d9c9784`
BLAKE2b-256	`63201ecc7ec1a97ddd844297582f0d7001feca0c97daaf552bb686f53ae65584`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chdb_ds-0.0.0-py3-none-any.whl:

Publisher: publish.yml on auxten/chdb-ds

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chdb_ds-0.0.0-py3-none-any.whl
- Subject digest: b71aff9f41542e56797b4446f2183e475e2f30e7a3e5ca1eaf181656dec70d38
- Sigstore transparency entry: 584799199
- Sigstore integration time: Oct 6, 2025
Source repository:
- Permalink: auxten/chdb-ds@4b23688b2e6d8003d45444c3db9d8d054087cfc7
- Branch / Tag: refs/tags/v0.0.0
- Owner: https://github.com/auxten
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4b23688b2e6d8003d45444c3db9d8d054087cfc7
- Trigger Event: push

chdb-ds 0.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

DataStore

Features

Quick Start

Installation

Basic Usage

Working with Expressions

Conditions

Supported Data Sources

Local Files

Cloud Storage

Databases

Data Lakes

Data Generation

URL/HTTP

Multi-Source Queries

Format Settings

Design Philosophy

Key Design Patterns

1. Immutability via @immutable Decorator

2. Operator Overloading

3. Smart Value Wrapping

Development

Running Tests

Running Individual Test Modules

Roadmap

Examples

License

Credits

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance