Command-line tool for querying CSV and Parquet data with natural language
Project description
DataTalk CLI
Ask questions to your CSV & Parquet files in plain English, right from your terminal.
DataTalk CLI eliminates the need to remember complex command-line flags or write SQL for quick data checks. Just ask your question in natural language, and get instant answers—all while keeping your data 100% local and private.
Why DataTalk?
Working with data files on the command line has always involved a trade-off:
- GUI tools (Excel, Tableau) are slow for large files and force you to leave your terminal workflow
- Powerful CLI tools (qsv, xsv) require memorizing dozens of commands and flags
- SQL-based tools (csvcli, DuckDB CLI) demand writing formal queries for simple questions
DataTalk eliminates this friction. It's a conversational data terminal that lets you:
- Stay in your terminal workflow without context switching
- Express your intent directly in natural language
- Get accurate, deterministic results computed locally
- Query both CSV and Parquet files with the same interface
Perfect for developers, data engineers, and analysts who value speed and terminal efficiency.
How It Works & Data Privacy
DataTalk is designed with a privacy-first, local-first architecture:
- Your data stays on your machine - Files are loaded into a local DuckDB instance
- Only metadata is sent to the LLM - The AI receives your question and the file schema (column names and types)
- AI generates SQL - The LLM creates a SQL query based on your question
- Query runs locally - DuckDB executes the query on your local data
- You see the results - Clean, formatted output in your terminal
What is sent to the LLM:
- Your natural language question
- Column names and data types
- A few sample values for context (optional)
What is NOT sent:
- Your actual data rows
- Sensitive information
- Full file contents
This approach gives you the convenience of natural language while maintaining full control over your data privacy.
Key Features
- 100% Local Processing - Data never leaves your machine
- Accurate Calculations - No AI hallucinations; real SQL on real data
- Token Efficient - Only schema metadata sent to LLM, saving costs
- Multiple Formats - Seamless support for CSV and Parquet files
- No Coding Required - Query data without pandas or SQL knowledge
- Transparent - Optional
--show-sqlflag to see generated queries - Smart Suggestions - Context-aware follow-up questions
Installation
pip install datatalk-cli
That's it! The command-line tool dtalk will be available immediately.
Requirements:
- Python 3.9 or higher
- An OpenAI or Azure OpenAI API key
Quick Start
Get up and running in 60 seconds:
-
Install DataTalk:
pip install datatalk-cli
-
Set your API key:
export OPENAI_API_KEY="your-key-here"
-
Download sample data and try it:
curl -O https://raw.githubusercontent.com/vtsaplin/datatalk/main/sample_data/sales_data.csv dtalk sales_data.csv "What are the top 5 products by revenue?"
That's it! You just queried data using plain English.
Configuration
DataTalk supports both OpenAI and Azure OpenAI. Choose your preferred method:
Option 1: Environment Variables (Recommended)
For OpenAI:
export OPENAI_API_KEY="your-api-key-here"
export OPENAI_MODEL="gpt-4o" # Optional, defaults to gpt-4o
For Azure OpenAI:
export AZURE_DEPLOYMENT_TARGET_URL="https://your-resource.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-12-01-preview"
export AZURE_OPENAI_API_KEY="your-api-key-here"
Option 2: Configuration File
Create a .env file in your project directory with the same variables as above.
Option 3: Interactive Setup
Simply run dtalk with a data file, and the tool will prompt you to enter your credentials. They'll be saved to ~/.config/datatalk/config.json for future use.
Configuration Management
dtalk --config-info # Show current configuration
dtalk --reset-config # Clear saved configuration
Usage
DataTalk offers two modes: interactive for exploration and direct query for scripting.
Interactive Mode
Launch an interactive session to ask multiple questions:
dtalk sales_data.csv
You'll see:
- Basic dataset statistics (row count, columns)
- AI-generated suggested questions
- An interactive prompt where you can ask anything
Type quit, exit, or press Ctrl+D to exit.
Direct Query Mode
Run a single query and exit—perfect for scripts and automation:
dtalk sales_data.csv "What were total sales in Q4?"
Examples: Basic Queries
Get started with these common questions:
File Structure & Overview:
dtalk employees.csv "How many rows are in this file?"
dtalk inventory.csv "Show me the column names and their data types"
dtalk customers.csv "Display the first 10 rows"
Simple Statistics:
dtalk sales_data.csv "What is the average order value?"
dtalk employees.csv "How many employees were hired each year?"
dtalk inventory.csv "Count the number of unique product categories"
Examples: Filtering & Sorting
Filtering Data:
dtalk customers.csv "Show all customers from Canada"
dtalk sales_data.csv "List orders over $1000 from the last month"
dtalk employees.csv "Find all engineers in the San Francisco office"
Sorting & Top N:
dtalk sales_data.csv "Top 10 customers by total purchase amount"
dtalk inventory.csv "Show the 5 most expensive products, descending"
dtalk employees.csv "List employees by hire date, newest first"
Examples: Aggregations & Analysis
Grouping & Aggregation:
dtalk sales_data.csv "Total revenue by product category"
dtalk customers.csv "Count customers by country and city"
dtalk employees.csv "Average salary by department and job title"
Time-Series Analysis:
dtalk sales_data.csv "Monthly revenue trend for 2024"
dtalk orders.csv "Compare sales between Q3 and Q4"
dtalk events.csv "Daily active users for the past week"
Multi-Column Analysis:
dtalk sales_data.csv "Which region has the highest average order value?"
dtalk employees.csv "Show department with the most growth in the last year"
dtalk inventory.csv "What's the profit margin by category and supplier?"
Examples: Working with Different Formats
DataTalk seamlessly handles both CSV and Parquet files:
# CSV files
dtalk data.csv "your question"
# Parquet files (same syntax!)
dtalk data.parquet "your question"
# Large Parquet files are fast
dtalk large_dataset.parquet "Count distinct users"
Advanced Features
See the Generated SQL
Want to learn SQL or verify the query? Use the --show-sql flag:
dtalk sales_data.csv "Top 5 products by revenue" --show-sql
Output will include:
SQL: SELECT product_name, SUM(revenue) as total_revenue
FROM events GROUP BY product_name
ORDER BY total_revenue DESC LIMIT 5
This is perfect for:
- Learning SQL patterns
- Debugging unexpected results
- Auditing what queries are generated
- Reusing queries in other tools
Track Token Usage
Monitor your API costs with --show-tokens:
dtalk data.csv "your question" --show-tokens
Displays:
- Number of API requests made
- Input tokens used
- Output tokens used
- Total tokens consumed
Control Output Visibility
Fine-tune what information is displayed:
# Hide detailed dataset schema
dtalk data.csv --hide-schema
# Hide raw query results (useful when scripting)
dtalk data.csv "query" --hide-data
# Hide AI-generated question suggestions
dtalk data.csv --hide-suggestions
Scripting & Automation
DataTalk works great in shell scripts and pipelines:
#!/bin/bash
# Daily sales report script
REPORT=$(dtalk sales.csv "What was yesterday's total revenue?" --hide-data -q)
echo "Daily Sales Report: $REPORT"
# Email or post to Slack
echo "$REPORT" | mail -s "Daily Sales" team@company.com
Combine with other Unix tools:
# Run queries on multiple files
for file in data_*.csv; do
echo "Analyzing $file..."
dtalk "$file" "row count"
done
# Chain with grep to filter results
dtalk sales.csv "show all orders" | grep "Canada"
Use Cases
DataTalk shines in these scenarios:
Quick Data Checks
- "How many rows are in this file?" before loading it into a database
- "What's the date range of this dataset?" for validation
- "Show me sample rows" to verify data format
Ad-Hoc Analysis
- Exploring unfamiliar datasets without writing code
- Quick statistics during incident response
- Spot-checking data exports from other systems
Data Engineering Workflows
- Validating ETL outputs
- Checking data quality after transformations
- Quick profiling of incoming data files
Learning & Documentation
- Using
--show-sqlto learn SQL patterns - Documenting common queries in natural language
- Creating repeatable analysis commands for team members
Comparison with Other Tools
| Tool | Interface | Query Method | Best For | DataTalk Advantage |
|---|---|---|---|---|
| DataTalk | CLI | Natural Language | Terminal users who value speed | Zero cognitive overhead; just ask |
| qsv | CLI | Commands/Flags | High-performance data wrangling | Easier for ad-hoc questions; no flags to remember |
| csvcli | CLI | SQL | Analysts who know SQL | Natural language for those who don't know SQL |
| DuckDB CLI | CLI | SQL | Complex analytical queries | Simpler for quick questions |
| pandas | Python | Code | Programmers, complex analysis | No coding required; instant results |
| Excel | GUI | Point & click | Non-technical users | Works with huge files; stays in terminal |
When to use DataTalk vs. alternatives:
- Use DataTalk for quick, explorative questions in your terminal workflow
- Use qsv/DuckDB for complex transformations or when you need maximum performance
- Use pandas when you need programmatic control or complex custom logic
- Use Excel/Tableau for visualization and interactive dashboards
Roadmap
Planned features and improvements:
Local LLM Support (High Priority)
- Integration with Ollama for 100% offline, private querying
- Zero API costs for sensitive data analysis
Query Transparency
--dry-runmode to see generated query without execution- Query history and replay functionality
Enhanced Output
--output json|csv|markdownfor better scriptability- Basic visualization generation ("plot sales by month")
Advanced Capabilities
- Multi-file queries ("compare revenue between sales_2023.csv and sales_2024.csv")
- Saved query templates for common patterns
- Integration with other CLI tools via pipes
Development
Clone and run locally:
git clone https://github.com/vtsaplin/datatalk.git
cd datatalk
uv run dtalk --help
Testing with Sample Data
Test with included sample files in interactive mode:
uv run dtalk sample_data/sales_data.csv
uv run dtalk sample_data/employees.csv
uv run dtalk sample_data/inventory.csv
Run direct queries:
uv run dtalk sample_data/employees.csv -p "Show hiring patterns by year"
uv run dtalk sample_data/sales_data.csv -p "What are the top categories?" --show-sql
Running Tests
uv sync --extra test
uv run pytest
Building & Releasing
Generate demo recording:
vhs demo.tape
Build package:
python -m build
Release (maintainers only):
./release.sh
FAQ
Q: Is my data sent to OpenAI/Azure?
A: No. Only the schema (column names and types) and your question are sent. Your actual data rows stay completely local.
Q: How much does it cost to use?
A: You need an OpenAI or Azure OpenAI API key. Each query typically uses 200-500 tokens (~$0.001-0.005 per query with GPT-4). Use --show-tokens to monitor usage.
Q: Can I use this offline or with local LLMs?
A: Local LLM support (via Ollama) is on our roadmap and will be released soon. This will enable 100% offline, private querying.
Q: What's the difference between DataTalk and pandas-ai?
A: pandas-ai is a Python library for use in code/notebooks. DataTalk is a standalone CLI tool for terminal users who want instant results without writing code.
Q: Does this work with databases like PostgreSQL or MySQL?
A: Currently, DataTalk only supports CSV and Parquet files. Database support may be added in future versions.
Q: Can I see what SQL queries are being generated?
A: Yes! Use the --show-sql flag to see the exact SQL query generated by the LLM before it executes.
Q: How large of files can I query?
A: DataTalk uses DuckDB under the hood, which efficiently handles multi-gigabyte files. Parquet files are particularly fast. The limiting factor is your local disk space and RAM.
Q: Can I use this in production scripts?
A: Yes! Use direct query mode (dtalk file.csv "query" --hide-data) in scripts. However, be aware of API rate limits and costs for high-frequency usage.
Troubleshooting
Problem: "Command not found: dtalk"
Solution: Make sure you've installed the package (pip install datatalk-cli) and that your Python scripts directory is in your PATH.
Problem: "No API key found"
Solution: Set your API key with export OPENAI_API_KEY="your-key" or run dtalk --config-info to check your configuration.
Problem: "Query returns unexpected results"
Solution: Use --show-sql to see the generated query. The LLM might be interpreting your question differently than expected. Try rephrasing your question more explicitly.
Problem: "File format not supported"
Solution: DataTalk currently supports only .csv and .parquet files. For Excel files, export to CSV first, or try csvcli which supports Excel natively.
Problem: "Queries are slow"
Solution:
- Large files: Parquet format is faster than CSV for big datasets
- API latency: Most time is spent waiting for the LLM response (1-3 seconds)
- Complex queries: Consider breaking them into smaller questions
Problem: "Rate limit exceeded"
Solution: You've hit your OpenAI API rate limit. Wait a few moments or upgrade your API plan.
Contributing
Contributions are welcome! Here's how you can help:
Report Bugs
Open an issue on GitHub with:
- The command you ran
- Expected vs. actual behavior
- Your Python version and OS
Suggest Features
Check the Roadmap section first, then open an issue describing:
- The use case
- How it would work
- Why it's valuable
Submit Pull Requests
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes with tests
- Run the test suite (
uv run pytest) - Submit a PR with a clear description
Areas Where We'd Love Help:
- Local LLM integration (Ollama, llama.cpp)
- Additional output formats (JSON, CSV, Markdown)
- Performance optimizations
- Documentation improvements
- Additional example queries
License
MIT License - see the LICENSE file for details.
Acknowledgments
Built with:
- DuckDB - Fast in-process analytical database
- OpenAI API - Natural language understanding
- Rich - Beautiful terminal formatting
- Pandas - Data manipulation
Inspired by the growing ecosystem of AI-powered developer tools and the enduring power of the command line.
Support
- Documentation
- Issue Tracker
- Discussions
- Star this repo if you find it useful!
Made for terminal users who love data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datatalk_cli-0.1.20.tar.gz.
File metadata
- Download URL: datatalk_cli-0.1.20.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb72e66b67dd0ab14b928d4fd6905648769f4104500aca2675ae04d9700e2d80
|
|
| MD5 |
6a44fafef176701a52c8a4d63ec5cd9f
|
|
| BLAKE2b-256 |
f57f5c7db49b46ce36fbc2e97f102ed708b47fb6041bfac36c17258509d1e16c
|
Provenance
The following attestation bundles were made for datatalk_cli-0.1.20.tar.gz:
Publisher:
publish.yml on vtsaplin/datatalk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datatalk_cli-0.1.20.tar.gz -
Subject digest:
eb72e66b67dd0ab14b928d4fd6905648769f4104500aca2675ae04d9700e2d80 - Sigstore transparency entry: 714383822
- Sigstore integration time:
-
Permalink:
vtsaplin/datatalk@9d88c01c2392f5f51045a0d527974064866fb4cd -
Branch / Tag:
refs/tags/v0.1.20 - Owner: https://github.com/vtsaplin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9d88c01c2392f5f51045a0d527974064866fb4cd -
Trigger Event:
push
-
Statement type:
File details
Details for the file datatalk_cli-0.1.20-py3-none-any.whl.
File metadata
- Download URL: datatalk_cli-0.1.20-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d5bf443c7f6b75cc73588e935f03accd11bbd34d84f524445f6621f2338a29b
|
|
| MD5 |
ca4b8c7e6ab3500c9b166cd939f3ef87
|
|
| BLAKE2b-256 |
228fd4e42dd1c57385b4f044192dc0979aac098a4cf12e5c43ff81ea3c3ebebc
|
Provenance
The following attestation bundles were made for datatalk_cli-0.1.20-py3-none-any.whl:
Publisher:
publish.yml on vtsaplin/datatalk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datatalk_cli-0.1.20-py3-none-any.whl -
Subject digest:
4d5bf443c7f6b75cc73588e935f03accd11bbd34d84f524445f6621f2338a29b - Sigstore transparency entry: 714383829
- Sigstore integration time:
-
Permalink:
vtsaplin/datatalk@9d88c01c2392f5f51045a0d527974064866fb4cd -
Branch / Tag:
refs/tags/v0.1.20 - Owner: https://github.com/vtsaplin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9d88c01c2392f5f51045a0d527974064866fb4cd -
Trigger Event:
push
-
Statement type: