A Pandas-like data manipulation framework with automatic SQL generation
Project description
DataStore
⚠️ EXPERIMENTAL: This project is currently in experimental stage. APIs may change without notice. Not recommended for production use yet.
A Pandas-like data manipulation framework powered by chDB (ClickHouse) with automatic SQL generation and execution capabilities. Query files, databases, and cloud storage with a unified interface.
Quick Start
Installation
chdb-ds is rapidly evolving with frequent updates. To get the latest features and fixes, install directly from the repository:
pip install -U git+https://github.com/auxten/chdb-ds.git --break-system-packages
Note:
--break-system-packagesis required on some systems (e.g., macOS with Homebrew Python, Debian/Ubuntu with system Python). Alternatively, use a virtual environment to avoid this flag.
Your First Query (30 seconds)
Just change your import - use the pandas API you already know:
import datastore as pd # That's it! Use pandas API as usual
# Create a DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NYC', 'LA', 'NYC', 'LA']
})
# Filter with familiar pandas syntax
result = df[df['age'] > 26]
print(result)
# name age city
# 1 Bob 30 LA
# 2 Charlie 35 NYC
# 3 Diana 28 LA
# GroupBy works too
print(df.groupby('city')['age'].mean())
# city
# LA 29.0
# NYC 30.0
✨ Zero code changes required. All operations are automatically lazy - they're recorded and compiled into optimized SQL, executed only when results are needed (like print()). Your existing pandas code just runs faster.
Working with Real Data (1 minute)
Query local files with automatic format detection:
from datastore import DataStore
# Load a CSV file
ds = DataStore.from_file("sales.csv")
# Explore your data
print(ds.head()) # Preview first 5 rows
print(ds.shape) # (10000, 7) - rows × columns
print(ds.columns) # ['id', 'product', 'revenue', 'date', ...]
# Build and execute queries
result = (ds
.select("product", "revenue", "date")
.filter(ds.revenue > 1000)
.filter(ds.date >= "2024-01-01")
.sort("revenue", ascending=False)
.limit(10)
.to_df())
print(result)
URI-based Creation (For Remote Sources)
For cloud storage and databases, use URI strings with automatic type inference:
from datastore import DataStore
# S3 with anonymous access
ds = DataStore.uri("s3://bucket/data.parquet?nosign=true")
result = ds.select("*").limit(10).to_df()
# MySQL with connection string
ds = DataStore.uri("mysql://root:pass@localhost:3306/mydb/users")
result = ds.select("*").filter(ds.active == True).to_df()
# PostgreSQL
ds = DataStore.uri("postgresql://user:pass@localhost:5432/mydb/products")
result = ds.select("*").to_df()
Supported URI formats:
- Local files:
file:///path/to/data.csvor/path/to/data.csv - S3:
s3://bucket/key - Google Cloud Storage:
gs://bucket/path - Azure Blob Storage:
az://container/blob - HDFS:
hdfs://namenode:port/path - HTTP/HTTPS:
https://example.com/data.json - MySQL:
mysql://user:pass@host:port/database/table - PostgreSQL:
postgresql://user:pass@host:port/database/table - MongoDB:
mongodb://user:pass@host:port/database.collection - SQLite:
sqlite:///path/to/db.db?table=tablename - ClickHouse:
clickhouse://host:port/database/table - Delta Lake:
deltalake:///path/to/table - Apache Iceberg:
iceberg://catalog/namespace/table - Apache Hudi:
hudi:///path/to/table
Traditional Way: Factory Methods
You can also use dedicated factory methods for more control:
from datastore import DataStore
# Query local files
ds = DataStore.from_file("data.parquet")
result = ds.select("*").filter(ds.age > 18).execute()
# Query S3
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
result = ds.select("name", "age").limit(10).execute()
# Query MySQL
ds = DataStore.from_mysql(
host="localhost:3306",
database="mydb",
table="users",
user="root",
password="pass"
)
result = ds.select("*").filter(ds.active == True).execute()
# Build complex queries with method chaining
query = (ds
.select("name", "age", "city")
.filter(ds.age > 18)
.filter(ds.city == "NYC")
.sort("name")
.limit(10))
# Generate SQL
print(query.to_sql())
# Output: SELECT "name", "age", "city" FROM mysql(...)
# WHERE ("age" > 18 AND "city" = 'NYC')
# ORDER BY "name" ASC LIMIT 10
# Execute query
result = query.execute()
# exec() is an alias for execute() - use whichever you prefer
result = query.exec() # Same as execute()
Working with Expressions
from datastore import Field, Sum, Count, col
# Arithmetic operations
ds.select(
ds.price * 1.1, # 10% price increase
(ds.revenue - ds.cost).as_("profit")
)
# Aggregate functions (traditional style)
ds.groupby("category").select(
Field("category"),
Sum(Field("amount"), alias="total"),
Count("*", alias="count")
)
# Aggregate functions (SQL-style with agg())
ds.groupby("region").agg(
total_revenue=col("revenue").sum(),
avg_quantity=col("quantity").mean(),
order_count=col("order_id").count()
)
# Pandas-style aggregation
ds.agg({'amount': 'sum', 'price': ['mean', 'max']})
ClickHouse SQL Functions
DataStore provides 100+ ClickHouse SQL functions through Pandas-like accessors:
# String functions via .str accessor
ds['name'].str.upper() # upper(name)
ds['name'].str.length() # length(name)
ds['text'].str.replace('old', 'new') # replace(text, 'old', 'new')
ds['email'].str.contains('@') # position(email, '@') > 0
# DateTime functions via .dt accessor
ds['date'].dt.year # toYear(date)
ds['date'].dt.month # toMonth(date)
ds['date'].dt.add_days(7) # addDays(date, 7)
ds['start'].dt.days_diff(ds['end']) # dateDiff('day', start, end)
# Math functions as expression methods
ds['value'].abs() # abs(value)
ds['price'].round(2) # round(price, 2)
ds['value'].sqrt() # sqrt(value)
# Type conversion
ds['value'].cast('Float64') # CAST(value AS Float64)
ds['id'].to_string() # toString(id)
# Aggregate functions
ds['amount'].sum() # sum(amount)
ds['price'].avg() # avg(price)
ds['user_id'].count_distinct() # uniq(user_id)
# Column assignment with functions (lazy evaluation)
ds['upper_name'] = ds['name'].str.upper()
ds['age_group'] = ds['age'] // 10 * 10
⚠️ Important: Lazy Column Assignment Unlike pandas, DataStore is a lazy evaluation engine. Column assignments using
ds['col'] = ...are lazy - they are recorded but not executed immediately. The operations are applied when you execute the data withto_df(),execute(), or access properties likeshape, or trigger repr() or str() likeprint(ds)or justdsin IPython or Jupyter Notebook.ds['new_col'] = ds['old_col'] * 2 # Recorded (lazy) print(ds.to_sql()) # Won't show new_col in SQL yet result = ds.to_df() # NOW it executes and applies assignment print(result.columns) # Will include 'new_col'For immutable column creation that returns a new DataStore, use
assign():ds2 = ds.assign(new_col=lambda x: x['old_col'] * 2) # Returns new DataStore
See Function Reference for the complete list of 100+ functions.
Working with Results
DataStore provides convenient methods to get results as pandas DataFrames or dictionaries:
# Get results as DataFrame (simplified)
df = ds.select("*").filter(ds.age > 18).to_df()
# Get results as list of dictionaries (simplified)
records = ds.select("*").filter(ds.age > 18).to_dict()
# Traditional way (also supported)
result = ds.select("*").execute()
df = result.to_df()
records = result.to_dict()
# Access raw result metadata
result = ds.select("*").execute()
print(result.column_names) # ['id', 'name', 'age']
print(result.row_count) # 42
print(result.rows) # List of tuples
Working with Existing DataFrames
Use from_df() or from_dataframe() to wrap an existing pandas DataFrame and leverage DataStore's query building, SQL operations, and lazy execution:
import pandas as pd
from datastore import DataStore
# Create or load a DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'department': ['Sales', 'Engineering', 'Sales', 'Marketing']
})
# Wrap with DataStore
ds = DataStore.from_df(df, name='employees')
# Use DataStore features: filtering, SQL, lazy operations
result = ds.filter(ds.age > 26).to_df()
# Execute SQL on DataFrame via chDB
result = ds.sql('age > 28 ORDER BY name').to_df()
# Complex query mixing SQL and pandas
ds['salary_band'] = (ds.age // 10) * 10000 # Lazy column assignment
result = (ds
.filter(ds.department == 'Sales')
.sql('salary_band >= 30000')
.select('name', 'age', 'salary_band')
.to_df())
# Alias: from_dataframe() works the same way
ds = DataStore.from_dataframe(df, name='employees')
Key benefits:
- Apply SQL queries to in-memory DataFrames via chDB's Python() table function
- Mix SQL and pandas operations in any order
- Use explain() to see the execution plan
- Leverage 100+ ClickHouse SQL functions on DataFrame data
Pandas DataFrame Compatibility
DataStore now includes comprehensive pandas DataFrame API compatibility (209 DataFrame methods + 56 str accessor + 42 dt accessor), allowing you to use all pandas methods directly:
# All pandas properties work
print(ds.shape) # (rows, columns)
print(ds.columns) # Column names
print(ds.dtypes) # Data types
print(ds.values) # NumPy array
# All pandas statistical methods
ds.mean() # Mean values
ds.median() # Median values
ds.std() # Standard deviation
ds.corr() # Correlation matrix
ds.describe() # Statistical summary
# All pandas data manipulation methods
ds.drop(columns=['col1'])
ds.rename(columns={'old': 'new'})
ds.sort_values('column', ascending=False)
ds.fillna(0)
ds.dropna()
ds.drop_duplicates()
ds.assign(new_col=lambda x: x['col1'] * 2)
# Advanced operations
ds.pivot_table(values='sales', index='region', columns='product')
ds.melt(id_vars=['id'], value_vars=['col1', 'col2'])
ds.merge(other_ds, on='id', how='left')
ds.groupby('category').agg({'amount': 'sum', 'count': 'count'})
# Column selection (pandas style)
ds['column'] # Single column
ds[['col1', 'col2']] # Multiple columns
# Convenience methods
first_5 = ds.head() # First 5 rows
last_5 = ds.tail() # Last 5 rows
sample = ds.sample(n=100, random_state=42)
# Mix SQL-style and pandas operations - arbitrary order!
result = (ds
.select('*')
.filter(ds.price > 100) # SQL-style filtering
.assign(revenue=lambda x: x['price'] * x['quantity']) # Pandas assign
.filter(ds.revenue > 1000) # SQL on DataFrame!
.add_prefix('sales_') # Pandas transform
.query('sales_revenue > 5000') # Pandas query
.select('sales_id', 'sales_revenue')) # SQL on DataFrame again!
# Export to various formats
ds.to_csv('output.csv')
ds.to_json('output.json')
ds.to_parquet('output.parquet')
ds.to_excel('output.xlsx')
See Pandas Compatibility Guide for the complete feature checklist and examples.
NumPy Compatibility
DataStore is fully compatible with NumPy, allowing direct use with all common NumPy functions:
import numpy as np
from datastore import DataStore
ds = DataStore.from_file("data.csv")
# ✅ All NumPy functions work directly - no conversion needed!
np.mean(ds['column']) # Compute mean
np.std(ds['column']) # Standard deviation
np.sum(ds['column']) # Sum
np.min(ds['column']) # Minimum
np.max(ds['column']) # Maximum
np.median(ds['column']) # Median
np.var(ds['column']) # Variance
np.allclose(ds['a'], ds['b']) # Compare columns
np.corrcoef(ds['a'], ds['b']) # Correlation
np.dot(ds['a'], ds['b']) # Dot product
np.percentile(ds['column'], [25, 50, 75]) # Percentiles
np.histogram(ds['column'], bins=10) # Histogram
# Data normalization
values = ds['price']
normalized = (np.asarray(values) - np.mean(values)) / np.std(values)
# SQL filtering + NumPy computation
filtered = ds.filter(ds['age'] > 25)
mean_salary = np.mean(filtered['salary'])
Key features:
__array__interface implemented for seamless NumPy integration.valuesproperty and.to_numpy()method available (pandas compatible)- All statistical methods (
mean(),sum(),std(), etc.) accept NumPy-style parameters - Same usage experience as Pandas DataFrame/Series
See NUMPY_QUICK_REFERENCE.md for complete compatibility list.
Conditions
# Simple conditions
ds.filter(ds.age > 18)
ds.filter(ds.status == "active")
# where() is an alias for filter() - use whichever you prefer
ds.where(ds.age > 18) # Same as filter()
# Complex conditions
ds.filter(
((ds.age > 18) & (ds.age < 65)) |
(ds.status == "premium")
)
# Negation
ds.filter(~(ds.deleted == True))
Conditional Column Creation (CASE WHEN)
Create columns with conditional logic using when().otherwise(), equivalent to SQL CASE WHEN or np.where()/np.select():
# Simple binary condition (equivalent to np.where)
ds['status'] = ds.when(ds['value'] >= 100, 'high').otherwise('low')
# Multiple conditions (equivalent to np.select)
ds['grade'] = (
ds.when(ds['score'] >= 90, 'A')
.when(ds['score'] >= 80, 'B')
.when(ds['score'] >= 70, 'C')
.otherwise('F')
)
# Using expressions as values
ds['adjusted'] = ds.when(ds['value'] < 0, 0).otherwise(ds['value'] * 2)
# Column as value
ds['max_val'] = ds.when(ds['a'] > ds['b'], ds['a']).otherwise(ds['b'])
This is semantically equivalent to numpy:
# np.where
df['status'] = np.where(df['value'] >= 100, 'high', 'low')
# np.select
conditions = [df['score'] >= 90, df['score'] >= 80, df['score'] >= 70]
df['grade'] = np.select(conditions, ['A', 'B', 'C'], default='F')
Execution Engine: By default, uses chDB SQL engine. Switch to pandas via function_config.use_pandas('when').
Philosophy
Respect pandas expertise. Optimize with modern SQL.
DataStore is built on a simple belief: data scientists shouldn't have to choose between the familiar pandas API and the performance of modern SQL engines. Our approach:
-
Respect Pandas Experience: We deeply respect pandas' API design and user habits. DataStore aims to let you use your existing pandas knowledge with minimal code changes.
-
Lazy Execution for Performance: All operations are lazy by default. Cross-row operations (aggregations, groupby, filters) are compiled into chDB SQL for execution, leveraging ClickHouse's columnar engine optimizations.
-
Cache for Exploration: Exploratory data analysis (EDA) often involves repeated queries on the same data. DataStore caches intermediate results to make your iterative analysis faster.
-
Pragmatic Compatibility: We don't guarantee 100% pandas syntax compatibility—that's not our goal. Instead, we run extensive compatibility tests using
import datastore as pdto ensure you can migrate existing code with minimal changes while gaining chDB's performance benefits.
import datastore as pd # Just change this import!
df = pd.read_csv("employee_data.csv")
# Multi-line operations - all lazy until result is needed
filtered = df[(df['age'] > 25) & (df['salary'] > 50000)]
grouped = filtered.groupby('city')['salary'].agg(['mean', 'sum', 'count'])
sorted_df = grouped.sort_values('mean', ascending=False)
result = sorted_df.head(10)
print(result) # Lazy execution triggered here!
Full SQL compilation - the entire pipeline compiles to a single optimized SQL query:
SELECT city, AVG(salary) AS mean, SUM(salary) AS sum, COUNT(salary) AS count
FROM file('employee_data.csv', 'CSVWithNames')
WHERE age > 25 AND salary > 50000
GROUP BY city ORDER BY mean DESC LIMIT 10
All operations are executed by chDB:
read_csv→file()table function ✅filter→WHEREclause ✅groupby().agg()→GROUP BY+ aggregation functions ✅sort_values→ORDER BY✅head→LIMIT✅
Design principle: API style must not determine execution engine. Both pandas and fluent APIs should compile to the same optimized SQL.
Why faster? With full SQL compilation, DataStore benefits from:
- Columnar storage: Read only needed columns (
city,salary,age) - Predicate pushdown: Filter
age > 25 AND salary > 50000during file scan - Zero-copy data exchange: No redundant copies between pandas and chDB
- Lazy execution: Build entire operation chain, optimize before execution
- Single-pass processing: One SQL query instead of multiple pandas operations
- Vectorized aggregation: C++ based GROUP BY, AVG, SUM in chDB
- Early termination: LIMIT pushdown to avoid processing all rows
Comparison with Similar Libraries
| Feature | DataStore | Polars | DuckDB | Modin |
|---|---|---|---|---|
| API Style | pandas + fluent SQL | New API | SQL-first | pandas drop-in |
| Migration Effort | Low (change import) | High (new API) | High (SQL rewrite) | Low |
| SQL Support | ✅ Full ClickHouse SQL | ⚠️ Limited (SQLContext) | ✅ Full | ❌ |
| File Formats | ✅ 100+ (ClickHouse) | ~10 | ~15 | via pandas |
| Data Sources | ✅ 20+ (S3, DBs, Lakes) | ~5 | ~10 (extensions) | via pandas |
| Zero-Copy pandas | ✅ Native | ❌ (copy required) | ✅ via Arrow | ❌ |
| ClickHouse Functions | ✅ 334 (geo, IP, URL...) | ❌ | ❌ | ❌ |
| Lazy Execution | ✅ Automatic | ⚠️ Manual (LazyFrame) | ✅ Automatic | ❌ Eager |
When to choose DataStore:
- You have existing pandas code and want minimal migration
- You need ClickHouse's 100+ file formats or 20+ data sources
- You want SQL power with pandas comfort
- You need ClickHouse-specific functions (geo, URL, IP, JSON, array, etc.)
When to choose alternatives:
- Polars: Starting fresh, prefer Rust-based DataFrame library, willing to learn new API
- DuckDB: Prefer SQL-first workflow, don't need pandas-style API
- Modin: Need true drop-in replacement with Ray/Dask distributed backend
Prefer explicit fluent API? Same performance, different style:
from datastore import DataStore
ds = DataStore.from_file("employee_data.csv")
result = (ds
.filter((ds.age > 25) & (ds.salary > 50000))
.groupby('city')
.agg({'salary': ['mean', 'sum', 'count']})
.sort_values('salary_mean', ascending=False)
.head(10))
Features
- Fluent API: Pandas-like interface for data manipulation
- Full Pandas Compatibility: 209 DataFrame methods + 56 str accessor + 42 dt accessor (all pandas methods covered)
- ClickHouse Extensions: Additional
.arr,.json,.url,.ip,.geoaccessors with 100+ ClickHouse-specific functions - Full NumPy Compatibility: Direct use with all NumPy functions (mean, std, corrcoef, etc.)
- DataFrame Interchange Protocol: Direct use with seaborn, plotly and other visualization libraries
- Mixed Execution Engine: Arbitrary mixing of SQL(chDB) and pandas operations
- Immutable Operations: Thread-safe method chaining
- Unified Interface: Query files, databases, and cloud storage with the same API
- 20+ Data Sources: Local files, S3, Azure, GCS, HDFS, MySQL, PostgreSQL, MongoDB, Redis, SQLite, ClickHouse, and more
- Data Lake Support: Iceberg, Delta Lake, Hudi table formats
- Format Auto-Detection: Automatically detect file formats from extensions
- SQL Generation: Automatic conversion to optimized SQL queries
- Type-Safe: Comprehensive type hints and validation
- Extensible: Easy to add custom functions and data sources
Supported Data Sources
DataStore supports 20+ data sources through a unified interface:
| Category | Sources | Quick Example |
|---|---|---|
| Local Files | CSV, Parquet, JSON, ORC, Avro + 80 more formats |
DataStore.from_file("data.csv") |
| Cloud Storage | S3, GCS, Azure Blob, HDFS | DataStore.from_s3("s3://bucket/data.parquet") |
| Databases | MySQL, PostgreSQL, ClickHouse, MongoDB, SQLite, Redis |
DataStore.from_mysql(host, db, table) |
| Data Lakes | Apache Iceberg, Delta Lake, Hudi | DataStore.from_delta("s3://bucket/table") |
| Other | HTTP/HTTPS, Number generation, Random data |
DataStore.from_url("https://...") |
Quick Examples
Local Files (auto-detects format):
ds = DataStore.from_file("data.parquet")
ds = DataStore.from_file("data.csv")
ds = DataStore.from_file("data.json")
Cloud Storage:
# S3 with public access
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
# S3 with credentials
ds = DataStore.from_s3("s3://bucket/*.csv",
access_key_id="KEY",
secret_access_key="SECRET")
# Google Cloud Storage
ds = DataStore.from_gcs("gs://bucket/data.parquet")
# Azure Blob Storage
ds = DataStore.from_azure(container="mycontainer",
path="data/*.parquet",
connection_string="...")
Databases:
# MySQL
ds = DataStore.from_mysql("localhost:3306", "mydb", "users",
user="root", password="pass")
# PostgreSQL
ds = DataStore.from_postgresql("localhost:5432", "mydb", "users",
user="postgres", password="pass")
# ClickHouse (remote)
ds = DataStore.from_clickhouse("localhost:9000", "default", "events")
From pandas DataFrame:
import pandas as pd
# Wrap an existing DataFrame
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
ds = DataStore.from_df(df) # or from_dataframe(df)
# Use DataStore features on DataFrame
result = ds.filter(ds.age > 26).to_df()
# Mix SQL and pandas operations
ds['doubled'] = ds.age * 2
result = ds.sql('doubled > 50').to_df()
Data Generation (for testing):
# Number sequence
ds = DataStore.from_numbers(100) # 0-99
# Random data
ds = DataStore.from_random(
structure="id UInt32, name String, value Float64",
random_seed=42
)
📖 For comprehensive examples of all data sources, see examples/examples_table_functions.py
Multi-Source Queries
# Join data from different sources
csv_data = DataStore.from_file("sales.csv", format="CSV")
mysql_data = DataStore.from_mysql("localhost:3306", "mydb", "customers",
user="root", password="pass")
result = (mysql_data
.join(csv_data, left_on="id", right_on="customer_id")
.select("name", "product", "revenue")
.filter(csv_data.date >= '2024-01-01')
.execute())
# Simplified join syntax with USING (when column names match)
users = DataStore.from_file("users.csv")
orders = DataStore.from_file("orders.csv")
products = DataStore.from_file("products.csv")
# Chain multiple joins easily - no table prefix needed!
result = (users
.join(orders, on="user_id") # USING (user_id)
.join(products, on="product_id") # USING (product_id)
.select("name", "amount", "product_name")
.to_df())
# Also supports multiple columns
ds.join(other, on=["user_id", "country"]) # USING (user_id, country)
Format Settings
Optimize performance with format-specific settings:
# CSV settings
ds = DataStore.from_file("data.csv", format="CSV")
ds = ds.with_format_settings(
format_csv_delimiter=',',
input_format_csv_skip_first_lines=1,
input_format_csv_trim_whitespaces=1
)
# Parquet optimization
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = ds.with_format_settings(
input_format_parquet_filter_push_down=1,
input_format_parquet_bloom_filter_push_down=1
)
# JSON settings
ds = DataStore.from_file("data.json", format="JSONEachRow")
ds = ds.with_format_settings(
input_format_json_validate_types_from_metadata=1,
input_format_json_ignore_unnecessary_fields=1
)
Execution Model
Understanding when and how DataStore executes operations is key to using it effectively.
1. Query Building (Lazy)
These operations build the SQL query but don't execute it immediately:
ds = DataStore.from_file("data.csv")
ds = ds.select("name", "age") # Lazy - builds query
ds = ds.filter(ds.age > 18) # Lazy - adds WHERE clause
ds = ds.sort("name") # Lazy - adds ORDER BY
ds = ds.limit(10) # Lazy - adds LIMIT
# Nothing executed yet! Just building the query.
print(ds.to_sql()) # Shows the SQL that will be executed
# Output: SELECT "name", "age" FROM file('data.csv') WHERE "age" > 18 ORDER BY "name" ASC LIMIT 10
All these methods return a new DataStore instance (immutable) without executing any query.
2. Lazy Operations (Recorded)
Column assignments are recorded and applied during execution:
ds['new_col'] = ds['old_col'] * 2 # Recorded, not executed
ds['category'] = ds['value'] // 100 # Recorded, not executed
# Still not executed - new columns won't appear in SQL yet
print(ds.to_sql()) # Won't include new_col or category
See the warning box in Column Assignment for details.
3. Execution (Eager)
These operations trigger immediate query execution:
# Execute and get different result formats
result = ds.execute() # Returns QueryResult object
df = ds.to_df() # Returns pandas DataFrame
records = ds.to_dict() # Returns list of dictionaries
# These also trigger execution
shape = ds.shape # Executes to count rows/columns
cols = ds.columns # Executes to get column names
stats = ds.describe() # Executes and computes statistics
first_5 = ds.head() # Executes and returns first 5 rows
Best Practice: Push Operations to SQL
For optimal performance, keep operations in the SQL layer (lazy) as long as possible:
# ✅ Good: Everything pushed to SQL (fast)
result = (ds
.select('name', 'age', 'city')
.filter(ds.age > 18)
.filter(ds.city == 'NYC')
.sort('name')
.limit(100)
.to_df()) # Single query execution
# ❌ Bad: Executes early, filters in pandas (slow)
df = ds.to_df() # Loads ALL data into memory
df = df[df['age'] > 18]
df = df[df['city'] == 'NYC']
df = df.sort_values('name')
df = df.head(100)
Query Reuse
DataStore is immutable (except column assignment), so you can reuse query objects:
# Build base query once
base_query = ds.select("*").filter(ds.status == "active")
# Create different queries from the same base
recent = base_query.filter(ds.date >= '2024-01-01').to_df()
high_value = base_query.filter(ds.value > 1000).to_df()
summary = base_query.groupby('category').agg({'value': 'sum'}).to_df()
# Each executes independently without affecting others
Mixed Execution
DataStore supports mixing SQL and pandas operations:
result = (ds
.select('*')
.filter(ds.price > 100) # SQL filter
.assign(revenue=lambda x: x['price'] * x['quantity']) # Pandas operation
.sql("revenue > 1000") # SQL filter on new column with chDB (after pandas)
.to_df())
# Execution flow:
# 1. Execute SQL: SELECT * FROM ... WHERE price > 100
# 2. Apply pandas: add revenue column
# 3. Apply SQL filter: SELECT * FROM ... WHERE revenue > 1000
# 4. Return result, triggered by `to_df()`
Profiling Performance
DataStore includes built-in profiling capabilities to analyze execution performance:
from datastore import DataStore, enable_profiling, disable_profiling, get_profiler
# Enable profiling
enable_profiling()
ds = DataStore.from_file("data.csv")
result = (ds
.filter(ds.age > 25)
.groupby("department")
.agg({"salary": "mean"})
.to_df())
# Get profiling report
profiler = get_profiler()
profiler.report() # Print detailed timing breakdown
# Disable when done
disable_profiling()
See Profiling Guide for detailed usage.
DataFrame Interchange Protocol
DataStore implements the DataFrame Interchange Protocol (__dataframe__), enabling direct use with visualization libraries:
import seaborn as sns
from datastore import DataStore
ds = DataStore.from_file("data.csv")
# Use DataStore directly with seaborn - no conversion needed!
sns.scatterplot(data=ds, x="age", y="salary", hue="department")
sns.barplot(data=ds, x="category", y="value")
# Also works with plotly and other libraries supporting the protocol
import plotly.express as px
px.scatter(ds, x="age", y="salary", color="department")
Design Philosophy
DataStore is inspired by pypika's excellent query builder design but focuses on:
- High-level API: Pandas-like interface for data scientists
- Query Execution: Built-in execution capabilities (not just SQL generation)
- Data Source Abstraction: Unified interface across different backends
- Modern Python: Type hints, dataclasses, and Python 3.8+ features
Key Design Patterns
1. Immutability via @immutable Decorator
from datastore.utils import immutable
class DataStore:
@immutable
def select(self, *fields):
self._select_fields.extend(fields)
# Decorator handles copying and returning new instance
2. Operator Overloading
# Natural Python syntax
ds.age > 18 # BinaryCondition('>', Field('age'), Literal(18))
ds.price * 1.1 # ArithmeticExpression('*', Field('price'), Literal(1.1))
(cond1) & (cond2) # CompoundCondition('AND', cond1, cond2)
3. Smart Value Wrapping
Expression.wrap(42) # Literal(42)
Expression.wrap("hello") # Literal("hello")
Expression.wrap(None) # Literal(None)
Expression.wrap(Field('x'))# Field('x') (unchanged)
Development
Setup
# Install dev dependencies and pre-commit hooks
make install-dev
# Or manually install pre-commit hooks
make pre-commit-install
Pre-commit hooks will automatically run check-charset, black, and flake8 before each commit.
Running Tests
# Run all tests
python -m pytest datastore/tests/
# Run specific test file
python -m pytest datastore/tests/test_expressions.py
# Run with coverage
python -m pytest --cov=datastore datastore/tests/
# Generate HTML coverage report
python -m pytest --cov=datastore --cov-report=html datastore/tests/
# Open htmlcov/index.html in browser to view detailed coverage
Running Individual Test Modules
# Test expressions
python -m unittest datastore.tests.test_expressions
# Test conditions
python -m unittest datastore.tests.test_conditions
# Test functions
python -m unittest datastore.tests.test_functions
# Test core DataStore
python -m unittest datastore.tests.test_datastore_core
Roadmap
Alpha release
- Core expression system
- Condition system
- Function system
- Basic DataStore operations
- Immutability support
- ClickHouse table functions and formats support
- DataFrame operations (drop, assign, fillna, etc.) see Pandas Compatibility Guide
- Query executors
- ClickHouse SQL functions support (100+ functions via
.str,.dtaccessors) see Function Reference - Hybrid execution engine (configurable chDB/Pandas execution)
- Update and Save back data
- Chart support
- More data exploration functions, faster describe()
- Multiple backend support
- Mock data support
- Schema management(infer or set manually)
- Connection managers
- MCP for data science functions
- Support 'datastore.core.DataStore' in VSCode Data Wrangler
Beta release
- Unstructured data support(Images, Audios as a column)
- Arrow Table support (read/write directly)
- Embedding Generation support
- PyTorch DataLoader integration
- Python native UDFs support
- Hybrid Execution (Local and Remote)
Documentation
User Guides
- 🚀 Pandas Migration Guide - Step-by-step guide for pandas users to get started
- Function Reference - Complete list of 334 ClickHouse SQL functions with examples
- Pandas Compatibility Guide - 209 pandas DataFrame methods + accessors
- NumPy Compatibility - Full NumPy function compatibility guide
- Profiling Guide - Performance analysis and profiling
- Explain Method - Understanding execution plans
- Factory Methods - Creating DataStore from various sources
Developer Guides
- Architecture & Design - Core design principles and development philosophy
Examples
For more comprehensive examples, see:
- examples/examples_table_functions.py - Complete examples for all data sources including:
- Local files (CSV, Parquet, JSON, ORC, Avro and 80+ formats)
- Cloud storage (S3, Azure, GCS, HDFS, HTTP and 20+ protocols)
- Databases (MySQL, PostgreSQL, MongoDB, Redis, SQLite, ClickHouse)
- Data lakes (Iceberg, Delta Lake, Hudi)
- Data generation (numbers, random data)
- Multi-source joins
- Format-specific optimization settings
License
Apache License 2.0
Credits
Built with and inspired by:
- chDB - Embedded ClickHouse engine for Python
- ClickHouse - Fast open-source OLAP database
- Pandas - DataFrame API design
- PyPika - Query builder patterns
- SQLAlchemy - ORM and query builder concepts
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chdb_ds-0.0.8.tar.gz.
File metadata
- Download URL: chdb_ds-0.0.8.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73083ce045e5ad64608586254c046c307a6d0fd5faebd4442c325424e69985da
|
|
| MD5 |
5ba37d6d2b9a2e909ff5b9e3f8e489e2
|
|
| BLAKE2b-256 |
194c7723450e5095639b592d65552a29e95409bbb8d04a05a80dea26358c6b09
|
Provenance
The following attestation bundles were made for chdb_ds-0.0.8.tar.gz:
Publisher:
publish.yml on auxten/chdb-ds
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chdb_ds-0.0.8.tar.gz -
Subject digest:
73083ce045e5ad64608586254c046c307a6d0fd5faebd4442c325424e69985da - Sigstore transparency entry: 830505274
- Sigstore integration time:
-
Permalink:
auxten/chdb-ds@4346ff0098432247370efc91efb629c3768e8e25 -
Branch / Tag:
refs/tags/v0.0.8 - Owner: https://github.com/auxten
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4346ff0098432247370efc91efb629c3768e8e25 -
Trigger Event:
push
-
Statement type:
File details
Details for the file chdb_ds-0.0.8-py3-none-any.whl.
File metadata
- Download URL: chdb_ds-0.0.8-py3-none-any.whl
- Upload date:
- Size: 446.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e5472206a3e7b031bfc5ed5a590fb71a15c1aa80dab5456864a609ab371659c
|
|
| MD5 |
d401380b7af28fee06f8a8ea8065c5b1
|
|
| BLAKE2b-256 |
586097634e84ae7c38f3deacb0cae68520808ba63b1ce7fbe9c73634d3defe35
|
Provenance
The following attestation bundles were made for chdb_ds-0.0.8-py3-none-any.whl:
Publisher:
publish.yml on auxten/chdb-ds
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chdb_ds-0.0.8-py3-none-any.whl -
Subject digest:
5e5472206a3e7b031bfc5ed5a590fb71a15c1aa80dab5456864a609ab371659c - Sigstore transparency entry: 830505276
- Sigstore integration time:
-
Permalink:
auxten/chdb-ds@4346ff0098432247370efc91efb629c3768e8e25 -
Branch / Tag:
refs/tags/v0.0.8 - Owner: https://github.com/auxten
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4346ff0098432247370efc91efb629c3768e8e25 -
Trigger Event:
push
-
Statement type: