A Pandas-like data manipulation framework with automatic SQL generation
Project description
DataStore
⚠️ EXPERIMENTAL: This project is currently in experimental stage. APIs may change without notice. Not recommended for production use yet.
A Pandas-like data manipulation framework powered by chDB (ClickHouse) with automatic SQL generation and execution capabilities. Query files, databases, and cloud storage with a unified interface.
Features
- Fluent API: Pandas-like interface for data manipulation
- Immutable Operations: Thread-safe method chaining
- Unified Interface: Query files, databases, and cloud storage with the same API
- 20+ Data Sources: Local files, S3, Azure, GCS, HDFS, MySQL, PostgreSQL, MongoDB, Redis, SQLite, ClickHouse, and more
- Data Lake Support: Iceberg, Delta Lake, Hudi table formats
- Format Auto-Detection: Automatically detect file formats from extensions
- SQL Generation: Automatic conversion to optimized SQL queries
- Type-Safe: Comprehensive type hints and validation
- Extensible: Easy to add custom functions and data sources
Quick Start
Installation
pip install chdb-ds
Basic Usage
from datastore import DataStore
# Query local files
ds = DataStore.from_file("data.parquet")
result = ds.select("*").filter(ds.age > 18).execute()
# Query S3
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
result = ds.select("name", "age").limit(10).execute()
# Query MySQL
ds = DataStore.from_mysql(
host="localhost:3306",
database="mydb",
table="users",
user="root",
password="pass"
)
result = ds.select("*").filter(ds.active == True).execute()
# Build complex queries with method chaining
query = (ds
.select("name", "age", "city")
.filter(ds.age > 18)
.filter(ds.city == "NYC")
.sort("name")
.limit(10))
# Generate SQL
print(query.to_sql())
# Output: SELECT "name", "age", "city" FROM mysql(...)
# WHERE ("age" > 18 AND "city" = 'NYC')
# ORDER BY "name" ASC LIMIT 10
# Execute query
result = query.execute()
Working with Expressions
from datastore import Field, Sum, Count
# Arithmetic operations
ds.select(
ds.price * 1.1, # 10% price increase
(ds.revenue - ds.cost).as_("profit")
)
# Aggregate functions
ds.groupby("category").select(
Field("category"),
Sum(Field("amount"), alias="total"),
Count("*", alias="count")
)
Conditions
# Simple conditions
ds.filter(ds.age > 18)
ds.filter(ds.status == "active")
# Complex conditions
ds.filter(
((ds.age > 18) & (ds.age < 65)) |
(ds.status == "premium")
)
# Negation
ds.filter(~(ds.deleted == True))
Supported Data Sources
DataStore provides factory methods for easy data source creation:
Local Files
# Automatically detect format from extension
ds = DataStore.from_file("data.parquet")
ds = DataStore.from_file("data.csv", format="CSV")
ds = DataStore.from_file("data.json", format="JSONEachRow")
Cloud Storage
# Amazon S3
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = DataStore.from_s3("s3://bucket/*.csv",
access_key_id="KEY",
secret_access_key="SECRET")
# Azure Blob Storage
ds = DataStore.from_azure(
connection_string="DefaultEndpointsProtocol=https;...",
container="mycontainer",
path="data/*.parquet"
)
# Google Cloud Storage
ds = DataStore.from_gcs("gs://bucket/data.parquet",
hmac_key="KEY",
hmac_secret="SECRET")
# HDFS
ds = DataStore.from_hdfs("hdfs://namenode:9000/data/*.parquet")
Databases
# MySQL
ds = DataStore.from_mysql("localhost:3306", "mydb", "users",
user="root", password="pass")
# PostgreSQL
ds = DataStore.from_postgresql("localhost:5432", "mydb", "users",
user="postgres", password="pass")
# ClickHouse (remote)
ds = DataStore.from_clickhouse("localhost:9000", "default", "events")
# MongoDB (read-only)
ds = DataStore.from_mongodb("localhost:27017", "mydb", "users",
user="admin", password="pass")
# SQLite (read-only)
ds = DataStore.from_sqlite("/path/to/database.db", "users")
# Redis
ds = DataStore.from_redis("localhost:6379",
key="key",
structure="key String, value String")
Data Lakes
# Apache Iceberg (read-only)
ds = DataStore.from_iceberg("s3://warehouse/my_table",
access_key_id="KEY",
secret_access_key="SECRET")
# Delta Lake (read-only)
ds = DataStore.from_delta("s3://bucket/delta_table",
access_key_id="KEY",
secret_access_key="SECRET")
# Apache Hudi (read-only)
ds = DataStore.from_hudi("s3://bucket/hudi_table",
access_key_id="KEY",
secret_access_key="SECRET")
Data Generation
# Generate number sequences
ds = DataStore.from_numbers(100) # 0 to 99
ds = DataStore.from_numbers(10, start=10) # 10 to 19
ds = DataStore.from_numbers(10, start=0, step=2) # Even numbers
# Generate random data for testing
ds = DataStore.from_random(
structure="id UInt32, name String, value Float64",
random_seed=42,
max_string_length=20
)
URL/HTTP
ds = DataStore.from_url("https://example.com/data.json",
format="JSONEachRow")
Multi-Source Queries
# Join data from different sources
csv_data = DataStore.from_file("sales.csv", format="CSV")
mysql_data = DataStore.from_mysql("localhost:3306", "mydb", "customers",
user="root", password="pass")
result = (mysql_data
.join(csv_data, left_on="id", right_on="customer_id")
.select("name", "product", "revenue")
.filter(csv_data.date >= '2024-01-01')
.execute())
Format Settings
Optimize performance with format-specific settings:
# CSV settings
ds = DataStore.from_file("data.csv", format="CSV")
ds = ds.with_format_settings(
format_csv_delimiter=',',
input_format_csv_skip_first_lines=1,
input_format_csv_trim_whitespaces=1
)
# Parquet optimization
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = ds.with_format_settings(
input_format_parquet_filter_push_down=1,
input_format_parquet_bloom_filter_push_down=1
)
# JSON settings
ds = DataStore.from_file("data.json", format="JSONEachRow")
ds = ds.with_format_settings(
input_format_json_validate_types_from_metadata=1,
input_format_json_ignore_unnecessary_fields=1
)
Design Philosophy
DataStore is inspired by pypika's excellent query builder design but focuses on:
- High-level API: Pandas-like interface for data scientists
- Query Execution: Built-in execution capabilities (not just SQL generation)
- Data Source Abstraction: Unified interface across different backends
- Modern Python: Type hints, dataclasses, and Python 3.7+ features
Key Design Patterns
1. Immutability via @immutable Decorator
from datastore.utils import immutable
class DataStore:
@immutable
def select(self, *fields):
self._select_fields.extend(fields)
# Decorator handles copying and returning new instance
2. Operator Overloading
# Natural Python syntax
ds.age > 18 # BinaryCondition('>', Field('age'), Literal(18))
ds.price * 1.1 # ArithmeticExpression('*', Field('price'), Literal(1.1))
(cond1) & (cond2) # CompoundCondition('AND', cond1, cond2)
3. Smart Value Wrapping
Expression.wrap(42) # Literal(42)
Expression.wrap("hello") # Literal("hello")
Expression.wrap(None) # Literal(None)
Expression.wrap(Field('x'))# Field('x') (unchanged)
Development
Running Tests
# Run all tests
python -m pytest datastore/tests/
# Run specific test file
python -m pytest datastore/tests/test_expressions.py
# Run with coverage
python -m pytest --cov=datastore datastore/tests/
# Generate HTML coverage report
python -m pytest --cov=datastore --cov-report=html datastore/tests/
# Open htmlcov/index.html in browser to view detailed coverage
Running Individual Test Modules
# Test expressions
python -m unittest datastore.tests.test_expressions
# Test conditions
python -m unittest datastore.tests.test_conditions
# Test functions
python -m unittest datastore.tests.test_functions
# Test core DataStore
python -m unittest datastore.tests.test_datastore_core
Roadmap
- Core expression system
- Condition system
- Function system
- Basic DataStore operations
- Immutability support
- ClickHouse table functions and formats support
- DataFrame operations (drop, assign, fillna, etc.)
- Query executors
- Multiple backend support
- Mock data support
- Schema management(infer or set manually)
- ClickHouse functions support
- Connection managers
- Image, Video, Audio data support
- PyTorch DataLoader integration
Examples
For more comprehensive examples, see:
- examples/examples_table_functions.py - Complete examples for all data sources including:
- Local files (CSV, Parquet, JSON, ORC, Avro and 80+ formats)
- Cloud storage (S3, Azure, GCS, HDFS, HTTP and 20+ protocols)
- Databases (MySQL, PostgreSQL, MongoDB, Redis, SQLite, ClickHouse)
- Data lakes (Iceberg, Delta Lake, Hudi)
- Data generation (numbers, random data)
- Multi-source joins
- Format-specific optimization settings
License
Apache License 2.0
Credits
Built with and inspired by:
- chDB - Embedded ClickHouse engine for Python
- ClickHouse - Fast open-source OLAP database
- Pandas - DataFrame API design
- PyPika - Query builder patterns
- SQLAlchemy - ORM and query builder concepts
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chdb_ds-0.0.2.tar.gz.
File metadata
- Download URL: chdb_ds-0.0.2.tar.gz
- Upload date:
- Size: 88.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7856f8e4814ff101682e5c8a5d96e5486a938dfbe7923f11301071c42df4a29
|
|
| MD5 |
e9f467b6704cd02cdf3f606c8141da2e
|
|
| BLAKE2b-256 |
2d3969e0e1f45aa8480d882b62a0104fea0f0a05890900879e93c3def6ef7dbf
|
Provenance
The following attestation bundles were made for chdb_ds-0.0.2.tar.gz:
Publisher:
publish.yml on auxten/chdb-ds
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chdb_ds-0.0.2.tar.gz -
Subject digest:
a7856f8e4814ff101682e5c8a5d96e5486a938dfbe7923f11301071c42df4a29 - Sigstore transparency entry: 591369655
- Sigstore integration time:
-
Permalink:
auxten/chdb-ds@4f2b0f9fc9fd6264a6ede65c8e4c829f20b21bf0 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/auxten
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4f2b0f9fc9fd6264a6ede65c8e4c829f20b21bf0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file chdb_ds-0.0.2-py3-none-any.whl.
File metadata
- Download URL: chdb_ds-0.0.2-py3-none-any.whl
- Upload date:
- Size: 42.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
202582a9372b390b09327364eeef6d86aaeb966ca805573ad55445adbddea926
|
|
| MD5 |
f9935bdf460a572e89d08dda440c841a
|
|
| BLAKE2b-256 |
db15a3090596133bc3a70918534298bdd2e3b821f29d2b14bdbb540c471a76c3
|
Provenance
The following attestation bundles were made for chdb_ds-0.0.2-py3-none-any.whl:
Publisher:
publish.yml on auxten/chdb-ds
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chdb_ds-0.0.2-py3-none-any.whl -
Subject digest:
202582a9372b390b09327364eeef6d86aaeb966ca805573ad55445adbddea926 - Sigstore transparency entry: 591369665
- Sigstore integration time:
-
Permalink:
auxten/chdb-ds@4f2b0f9fc9fd6264a6ede65c8e4c829f20b21bf0 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/auxten
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4f2b0f9fc9fd6264a6ede65c8e4c829f20b21bf0 -
Trigger Event:
push
-
Statement type: