AI-powered data science co-pilot using Claude — explore data, design projects, generate code, and brainstorm from anywhere Python runs.
Project description
DataSpark — AI-Powered Data Science Co-Pilot
A Python library that brings Claude's data science expertise into your local workflow — Jupyter notebooks, scripts, terminal, anywhere Python runs.
No browser needed. No logging in. Just import and go.
Quick Start
1. Install
pip install dataspark-ai # from PyPI
# or
pip install dataspark-ai[full] # includes sklearn, matplotlib, seaborn, plotly, scipy
# or from source
git clone https://github.com/KTG0409/dataspark.git
cd dataspark && pip install -e ".[dev]"
2. Set your API key
export ANTHROPIC_API_KEY="sk-ant-api03-..."
Get a key at console.anthropic.com/settings/keys
3. Use It
from dataspark import Spark
spark = Spark()
# Explore a dataset — get instant analysis, quality checks, and recommendations
spark.explore("sales_data.csv")
# Ask any data science question
spark.ask("Should I use one-hot encoding or target encoding for high-cardinality categoricals?")
# Design a complete project
spark.project("Build a customer churn prediction model for our SaaS platform")
# Brainstorm creative analysis ideas
spark.brainstorm("I have 3 years of e-commerce transaction data with 2M rows")
# Generate production code
spark.code("Build a feature engineering pipeline for time series with lag features and rolling stats")
# Get best practices guidance
spark.best_practices("Cross-validation strategies for time series data")
# Interactive session (like chatting with Claude)
spark.chat()
Core Features
spark.explore(source) — Dataset Analysis
Pass a CSV, Excel file, DataFrame, or URL. DataSpark will:
- Profile every column (types, distributions, outliers, correlations)
- Flag data quality issues
- Recommend specific analyses based on what it sees
- Ask you clarifying questions about your goals
- Provide ready-to-run code snippets
spark.explore("customers.csv")
spark.explore("https://data.example.com/dataset.csv")
spark.explore(my_dataframe, name="revenue")
# Focus on something specific
spark.explore("data.csv", focus="I need to predict the 'churned' column")
spark.project(description) — Project Design
Describe what you want to build. DataSpark designs the full pipeline:
spark.project("Forecast demand for 500 SKUs across 12 warehouses, daily granularity")
spark.project("Build a recommendation engine for our content platform")
spark.project("Anomaly detection for network traffic logs, ~10M events/day")
spark.brainstorm(context) — Idea Generation
Get creative, ranked ideas from quick wins to big bets:
spark.brainstorm("We have clickstream data, purchase history, and customer support tickets")
spark.brainstorm("Our marketing team wants to understand campaign attribution")
spark.code(request) — Code Generation
Get complete, production-quality Python code:
spark.code("XGBoost pipeline with Optuna hyperparameter tuning")
spark.code("Automated EDA function that generates a PDF report")
spark.code("FastAPI endpoint that serves predictions from a pickled model")
spark.ask(question) — Ask Anything
Maintains conversation history so you can have a back-and-forth:
spark.ask("What's the best way to handle class imbalance?")
spark.ask("Show me how to implement SMOTE with that approach")
spark.ask("Now how do I evaluate it properly?")
spark.chat() — Interactive Terminal Session
Full interactive mode with slash commands:
/explore data.csv — Load and analyze a dataset
/project <desc> — Design a project
/brainstorm <ctx> — Generate ideas
/code <request> — Generate code
/model sonnet — Switch models
/save conversation.md — Save chat history
/clear — Reset context
/help — Show commands
/quit — Exit
Configuration
# Model selection (default: Claude Sonnet)
spark = Spark(model="opus") # Most capable
spark = Spark(model="sonnet") # Balanced (default)
spark = Spark(model="haiku") # Fastest / cheapest
# Longer responses
spark = Spark(max_tokens=8192)
# Debug mode
spark = Spark(verbose=True)
Command-Line Usage
# Interactive chat
dataspark
# Explore a dataset
dataspark explore data.csv
dataspark explore data.csv -f "focus on the target variable"
# Quick question
dataspark ask "When should I use Ridge vs Lasso?"
# Project design
dataspark project "Build a fraud detection system"
# Use a specific model
dataspark -m opus explore big_dataset.parquet
Jupyter Notebook Tips
from dataspark import Spark
spark = Spark()
# Load data through spark — it profiles automatically
df = spark.load("data.csv")
# Now all questions are context-aware
spark.ask("What feature engineering should I do?")
spark.ask("Write the code for that")
# You can also explore at any point
spark.explore(focus="relationships between price and demand")
Architecture
dataspark/
├── __init__.py # Clean exports
├── core.py # Spark class — main interface & API calls
├── explorer.py # DataExplorer — load & profile datasets
├── profiles.py # DataProfile — statistical profiling
├── prompts.py # System prompts for each mode
└── cli.py # Command-line interface
The library works by:
- Profiling your data locally (pandas — nothing leaves your machine except the summary)
- Building rich context from the profile (statistics, distributions, quality issues)
- Sending that context + your question to Claude via the API
- Maintaining conversation history so follow-ups are contextual
Your raw data never leaves your machine. Only statistical summaries and column metadata are sent to the API.
Extending DataSpark
Custom System Prompts
from dataspark import Spark
spark = Spark()
spark._current_system = """You are a financial data science expert.
Focus on: regulatory compliance, risk modeling, backtesting.
Always consider: data leakage, survivorship bias, look-ahead bias."""
spark.ask("How should I backtest this trading strategy?")
Adding Data Context Manually
spark._data_context = """
We have a PostgreSQL database with:
- transactions (50M rows, 3 years)
- customers (2M rows)
- products (10K SKUs)
Business: B2B SaaS, $50M ARR, 15% annual churn
"""
spark.ask("What analyses would have the most business impact?")
Cost Awareness
API costs per ~1000 tokens (approximate):
| Model | Input | Output |
|---|---|---|
| Haiku | $0.001 | $0.005 |
| Sonnet | $0.003 | $0.015 |
| Opus | $0.015 | $0.075 |
A typical explore() call uses ~2-4K tokens. An interactive session might use 10-50K tokens total.
Use spark = Spark(model="haiku") for cost-sensitive workloads.
Privacy & Security
- Your raw data stays local. Only statistical summaries (means, distributions, column names, 3 sample rows) are sent to the API.
- API key is yours. Use a personal key — it's billed to your Anthropic account, not tied to any employer.
- No logging by default. Conversations are in-memory only unless you
/savethem. - Review what's sent. Call
spark.explorer.context_for_llm()to see exactly what goes to the API.
License
MIT — use it however you want.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataspark_ai-0.1.1.tar.gz.
File metadata
- Download URL: dataspark_ai-0.1.1.tar.gz
- Upload date:
- Size: 21.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6d99e5e1324332c6c949a11f7f18f23c7be0d2fbfb4e6291a2a8cd64b63c66e
|
|
| MD5 |
13548fd3a2879f354511aaa3b1db9f26
|
|
| BLAKE2b-256 |
42a0e4283449798febd4fbd3b02298c798b35e569c8ac218ebf228bf48c55bd3
|
File details
Details for the file dataspark_ai-0.1.1-py3-none-any.whl.
File metadata
- Download URL: dataspark_ai-0.1.1-py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fbf72dc5aa4af36376e5da9db000727620cbe6d14d580a4f0a4045e84d82d14
|
|
| MD5 |
b9bd60e114ce9e3d44eafb800236b1f7
|
|
| BLAKE2b-256 |
17027995c06c8c4a6b09adb3b9223154d7186d23d5711e5162d1755761468ecd
|