Command-line tool for querying CSV and Parquet data with natural language

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vtsaplin

These details have not been verified by PyPI

Project description

DataTalk CLI

Ask questions to your CSV & Parquet files in plain English, right from your terminal.

DataTalk CLI eliminates the need to remember complex command-line flags or write SQL for quick data checks. Just ask your question in natural language, and get instant answers—all while keeping your data 100% local and private.

Demo

Why DataTalk?

Working with data files on the command line has always involved a trade-off:

GUI tools (Excel, Tableau) are slow for large files and force you to leave your terminal workflow
Powerful CLI tools (qsv, xsv) require memorizing dozens of commands and flags
SQL-based tools (csvcli, DuckDB CLI) demand writing formal queries for simple questions

DataTalk eliminates this friction. It's a conversational data terminal that lets you:

Stay in your terminal workflow without context switching
Express your intent directly in natural language
Get accurate, deterministic results computed locally
Query both CSV and Parquet files with the same interface

Perfect for developers, data engineers, and analysts who value speed and terminal efficiency.

How It Works & Data Privacy

DataTalk is designed with a privacy-first, local-first architecture:

Your data stays on your machine - Files are loaded into a local DuckDB instance
Only metadata is sent to the LLM - The AI receives your question and the file schema (column names and types)
AI generates SQL - The LLM creates a SQL query based on your question
Query runs locally - DuckDB executes the query on your local data
You see the results - Clean, formatted output in your terminal

What is sent to the LLM:

Your natural language question
Column names and data types
A few sample values for context (optional)

What is NOT sent:

Your actual data rows
Sensitive information
Full file contents

This approach gives you the convenience of natural language while maintaining full control over your data privacy.

Key Features

100% Local Processing - Data never leaves your machine
Accurate Calculations - No AI hallucinations; real SQL on real data
Token Efficient - Only schema metadata sent to LLM, saving costs
Multiple Formats - Seamless support for CSV and Parquet files
No Coding Required - Query data without pandas or SQL knowledge
Transparent - Optional --show-sql flag to see generated queries
Smart Suggestions - Context-aware follow-up questions

Installation

pip install datatalk-cli

That's it! The command-line tool dtalk will be available immediately.

Requirements:

Python 3.9 or higher
An OpenAI or Azure OpenAI API key

Quick Start

Get up and running in 60 seconds:

Install DataTalk:
```
pip install datatalk-cli
```
Set your API key:
```
export OPENAI_API_KEY="your-key-here"
```

Download sample data and try it:

curl -O https://raw.githubusercontent.com/vtsaplin/datatalk/main/sample_data/sales_data.csv
dtalk sales_data.csv "What are the top 5 products by revenue?"

That's it! You just queried data using plain English.

Configuration

DataTalk supports both OpenAI and Azure OpenAI. Choose your preferred method:

Option 1: Environment Variables (Recommended)

For OpenAI:

export OPENAI_API_KEY="your-api-key-here"
export OPENAI_MODEL="gpt-4o"  # Optional, defaults to gpt-4o

For Azure OpenAI:

export AZURE_DEPLOYMENT_TARGET_URL="https://your-resource.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-12-01-preview"
export AZURE_OPENAI_API_KEY="your-api-key-here"

Option 2: Configuration File

Create a .env file in your project directory with the same variables as above.

Option 3: Interactive Setup

Simply run dtalk with a data file, and the tool will prompt you to enter your credentials. They'll be saved to ~/.config/datatalk/config.json for future use.

Configuration Management

dtalk --config-info      # Show current configuration
dtalk --reset-config     # Clear saved configuration

Usage

DataTalk offers two modes: interactive for exploration and direct query for scripting.

Interactive Mode

Launch an interactive session to ask multiple questions:

dtalk sales_data.csv

You'll see:

Basic dataset statistics (row count, columns)
AI-generated suggested questions
An interactive prompt where you can ask anything

Type quit, exit, or press Ctrl+D to exit.

Direct Query Mode

Run a single query and exit—perfect for scripts and automation:

dtalk sales_data.csv "What were total sales in Q4?"

Examples: Basic Queries

Get started with these common questions:

File Structure & Overview:

dtalk employees.csv "How many rows are in this file?"
dtalk inventory.csv "Show me the column names and their data types"
dtalk customers.csv "Display the first 10 rows"

Simple Statistics:

dtalk sales_data.csv "What is the average order value?"
dtalk employees.csv "How many employees were hired each year?"
dtalk inventory.csv "Count the number of unique product categories"

Examples: Filtering & Sorting

Filtering Data:

dtalk customers.csv "Show all customers from Canada"
dtalk sales_data.csv "List orders over $1000 from the last month"
dtalk employees.csv "Find all engineers in the San Francisco office"

Sorting & Top N:

dtalk sales_data.csv "Top 10 customers by total purchase amount"
dtalk inventory.csv "Show the 5 most expensive products, descending"
dtalk employees.csv "List employees by hire date, newest first"

Examples: Aggregations & Analysis

Grouping & Aggregation:

dtalk sales_data.csv "Total revenue by product category"
dtalk customers.csv "Count customers by country and city"
dtalk employees.csv "Average salary by department and job title"

Time-Series Analysis:

dtalk sales_data.csv "Monthly revenue trend for 2024"
dtalk orders.csv "Compare sales between Q3 and Q4"
dtalk events.csv "Daily active users for the past week"

Multi-Column Analysis:

dtalk sales_data.csv "Which region has the highest average order value?"
dtalk employees.csv "Show department with the most growth in the last year"
dtalk inventory.csv "What's the profit margin by category and supplier?"

Examples: Working with Different Formats

DataTalk seamlessly handles both CSV and Parquet files:

# CSV files
dtalk data.csv "your question"

# Parquet files (same syntax!)
dtalk data.parquet "your question"

# Large Parquet files are fast
dtalk large_dataset.parquet "Count distinct users"

Advanced Features

See the Generated SQL

Want to learn SQL or verify the query? Use the --show-sql flag:

dtalk sales_data.csv "Top 5 products by revenue" --show-sql

Output will include:

SQL: SELECT product_name, SUM(revenue) as total_revenue 
     FROM events GROUP BY product_name 
     ORDER BY total_revenue DESC LIMIT 5

This is perfect for:

Learning SQL patterns
Debugging unexpected results
Auditing what queries are generated
Reusing queries in other tools

Track Token Usage

Monitor your API costs with --show-tokens:

dtalk data.csv "your question" --show-tokens

Displays:

Number of API requests made
Input tokens used
Output tokens used
Total tokens consumed

Control Output Visibility

Fine-tune what information is displayed:

# Hide detailed dataset schema
dtalk data.csv --hide-schema

# Hide raw query results (useful when scripting)
dtalk data.csv "query" --hide-data

# Hide AI-generated question suggestions
dtalk data.csv --hide-suggestions

Scripting & Automation

DataTalk works great in shell scripts and pipelines:

#!/bin/bash
# Daily sales report script

REPORT=$(dtalk sales.csv "What was yesterday's total revenue?" --hide-data -q)
echo "Daily Sales Report: $REPORT"

# Email or post to Slack
echo "$REPORT" | mail -s "Daily Sales" team@company.com

Combine with other Unix tools:

# Run queries on multiple files
for file in data_*.csv; do
  echo "Analyzing $file..."
  dtalk "$file" "row count"
done

# Chain with grep to filter results
dtalk sales.csv "show all orders" | grep "Canada"

Use Cases

DataTalk shines in these scenarios:

Quick Data Checks

"How many rows are in this file?" before loading it into a database
"What's the date range of this dataset?" for validation
"Show me sample rows" to verify data format

Ad-Hoc Analysis

Exploring unfamiliar datasets without writing code
Quick statistics during incident response
Spot-checking data exports from other systems

Data Engineering Workflows

Validating ETL outputs
Checking data quality after transformations
Quick profiling of incoming data files

Learning & Documentation

Using --show-sql to learn SQL patterns
Documenting common queries in natural language
Creating repeatable analysis commands for team members

Comparison with Other Tools

Tool	Interface	Query Method	Best For	DataTalk Advantage
DataTalk	CLI	Natural Language	Terminal users who value speed	Zero cognitive overhead; just ask
qsv	CLI	Commands/Flags	High-performance data wrangling	Easier for ad-hoc questions; no flags to remember
csvcli	CLI	SQL	Analysts who know SQL	Natural language for those who don't know SQL
DuckDB CLI	CLI	SQL	Complex analytical queries	Simpler for quick questions
pandas	Python	Code	Programmers, complex analysis	No coding required; instant results
Excel	GUI	Point & click	Non-technical users	Works with huge files; stays in terminal

When to use DataTalk vs. alternatives:

Use DataTalk for quick, explorative questions in your terminal workflow
Use qsv/DuckDB for complex transformations or when you need maximum performance
Use pandas when you need programmatic control or complex custom logic
Use Excel/Tableau for visualization and interactive dashboards

Roadmap

Planned features and improvements:

Local LLM Support (High Priority)

Integration with Ollama for 100% offline, private querying
Zero API costs for sensitive data analysis

Query Transparency

--dry-run mode to see generated query without execution
Query history and replay functionality

Enhanced Output

--output json|csv|markdown for better scriptability
Basic visualization generation ("plot sales by month")

Advanced Capabilities

Multi-file queries ("compare revenue between sales_2023.csv and sales_2024.csv")
Saved query templates for common patterns
Integration with other CLI tools via pipes

Development

Clone and run locally:

git clone https://github.com/vtsaplin/datatalk.git
cd datatalk
uv run dtalk --help

Testing with Sample Data

Test with included sample files in interactive mode:

uv run dtalk sample_data/sales_data.csv
uv run dtalk sample_data/employees.csv
uv run dtalk sample_data/inventory.csv

Run direct queries:

uv run dtalk sample_data/employees.csv -p "Show hiring patterns by year"
uv run dtalk sample_data/sales_data.csv -p "What are the top categories?" --show-sql

Running Tests

uv sync --extra test
uv run pytest

Building & Releasing

Generate demo recording:

vhs demo.tape

Build package:

python -m build

Release (maintainers only):

./release.sh

FAQ

Q: Is my data sent to OpenAI/Azure?
A: No. Only the schema (column names and types) and your question are sent. Your actual data rows stay completely local.

Q: How much does it cost to use?
A: You need an OpenAI or Azure OpenAI API key. Each query typically uses 200-500 tokens (~$0.001-0.005 per query with GPT-4). Use --show-tokens to monitor usage.

Q: Can I use this offline or with local LLMs?
A: Local LLM support (via Ollama) is on our roadmap and will be released soon. This will enable 100% offline, private querying.

Q: What's the difference between DataTalk and pandas-ai?
A: pandas-ai is a Python library for use in code/notebooks. DataTalk is a standalone CLI tool for terminal users who want instant results without writing code.

Q: Does this work with databases like PostgreSQL or MySQL?
A: Currently, DataTalk only supports CSV and Parquet files. Database support may be added in future versions.

Q: Can I see what SQL queries are being generated?
A: Yes! Use the --show-sql flag to see the exact SQL query generated by the LLM before it executes.

Q: How large of files can I query?
A: DataTalk uses DuckDB under the hood, which efficiently handles multi-gigabyte files. Parquet files are particularly fast. The limiting factor is your local disk space and RAM.

Q: Can I use this in production scripts?
A: Yes! Use direct query mode (dtalk file.csv "query" --hide-data) in scripts. However, be aware of API rate limits and costs for high-frequency usage.

Troubleshooting

Problem: "Command not found: dtalk"
Solution: Make sure you've installed the package (pip install datatalk-cli) and that your Python scripts directory is in your PATH.

Problem: "No API key found"
Solution: Set your API key with export OPENAI_API_KEY="your-key" or run dtalk --config-info to check your configuration.

Problem: "Query returns unexpected results"
Solution: Use --show-sql to see the generated query. The LLM might be interpreting your question differently than expected. Try rephrasing your question more explicitly.

Problem: "File format not supported"
Solution: DataTalk currently supports only .csv and .parquet files. For Excel files, export to CSV first, or try csvcli which supports Excel natively.

Problem: "Queries are slow"
Solution:

Large files: Parquet format is faster than CSV for big datasets
API latency: Most time is spent waiting for the LLM response (1-3 seconds)
Complex queries: Consider breaking them into smaller questions

Problem: "Rate limit exceeded"
Solution: You've hit your OpenAI API rate limit. Wait a few moments or upgrade your API plan.

Contributing

Contributions are welcome! Here's how you can help:

Report Bugs
Open an issue on GitHub with:

The command you ran
Expected vs. actual behavior
Your Python version and OS

Suggest Features
Check the Roadmap section first, then open an issue describing:

The use case
How it would work
Why it's valuable

Submit Pull Requests

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests
Run the test suite (uv run pytest)
Submit a PR with a clear description

Areas Where We'd Love Help:

Local LLM integration (Ollama, llama.cpp)
Additional output formats (JSON, CSV, Markdown)
Performance optimizations
Documentation improvements
Additional example queries

License

MIT License - see the LICENSE file for details.

Acknowledgments

Built with:

DuckDB - Fast in-process analytical database
OpenAI API - Natural language understanding
Rich - Beautiful terminal formatting
Pandas - Data manipulation

Inspired by the growing ecosystem of AI-powered developer tools and the enduring power of the command line.

Support

Documentation
Issue Tracker
Discussions
Star this repo if you find it useful!

Made for terminal users who love data

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vtsaplin

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.27

Nov 30, 2025

0.1.26

Nov 26, 2025

0.1.25

Nov 25, 2025

0.1.24

Nov 24, 2025

0.1.23

Nov 22, 2025

0.1.22

Nov 22, 2025

0.1.21

Nov 22, 2025

This version

0.1.20

Nov 21, 2025

0.1.19

Nov 21, 2025

0.1.18

Nov 21, 2025

0.1.17

Nov 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatalk_cli-0.1.20.tar.gz (26.0 kB view details)

Uploaded Nov 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datatalk_cli-0.1.20-py3-none-any.whl (18.7 kB view details)

Uploaded Nov 21, 2025 Python 3

File details

Details for the file datatalk_cli-0.1.20.tar.gz.

File metadata

Download URL: datatalk_cli-0.1.20.tar.gz
Upload date: Nov 21, 2025
Size: 26.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datatalk_cli-0.1.20.tar.gz
Algorithm	Hash digest
SHA256	`eb72e66b67dd0ab14b928d4fd6905648769f4104500aca2675ae04d9700e2d80`
MD5	`6a44fafef176701a52c8a4d63ec5cd9f`
BLAKE2b-256	`f57f5c7db49b46ce36fbc2e97f102ed708b47fb6041bfac36c17258509d1e16c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datatalk_cli-0.1.20.tar.gz:

Publisher: publish.yml on vtsaplin/datatalk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datatalk_cli-0.1.20.tar.gz
- Subject digest: eb72e66b67dd0ab14b928d4fd6905648769f4104500aca2675ae04d9700e2d80
- Sigstore transparency entry: 714383822
- Sigstore integration time: Nov 21, 2025
Source repository:
- Permalink: vtsaplin/datatalk@9d88c01c2392f5f51045a0d527974064866fb4cd
- Branch / Tag: refs/tags/v0.1.20
- Owner: https://github.com/vtsaplin
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9d88c01c2392f5f51045a0d527974064866fb4cd
- Trigger Event: push

File details

Details for the file datatalk_cli-0.1.20-py3-none-any.whl.

File metadata

Download URL: datatalk_cli-0.1.20-py3-none-any.whl
Upload date: Nov 21, 2025
Size: 18.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datatalk_cli-0.1.20-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4d5bf443c7f6b75cc73588e935f03accd11bbd34d84f524445f6621f2338a29b`
MD5	`ca4b8c7e6ab3500c9b166cd939f3ef87`
BLAKE2b-256	`228fd4e42dd1c57385b4f044192dc0979aac098a4cf12e5c43ff81ea3c3ebebc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datatalk_cli-0.1.20-py3-none-any.whl:

Publisher: publish.yml on vtsaplin/datatalk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datatalk_cli-0.1.20-py3-none-any.whl
- Subject digest: 4d5bf443c7f6b75cc73588e935f03accd11bbd34d84f524445f6621f2338a29b
- Sigstore transparency entry: 714383829
- Sigstore integration time: Nov 21, 2025
Source repository:
- Permalink: vtsaplin/datatalk@9d88c01c2392f5f51045a0d527974064866fb4cd
- Branch / Tag: refs/tags/v0.1.20
- Owner: https://github.com/vtsaplin
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9d88c01c2392f5f51045a0d527974064866fb4cd
- Trigger Event: push

datatalk-cli 0.1.20

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

DataTalk CLI

Why DataTalk?

How It Works & Data Privacy

Key Features

Installation

Quick Start

Configuration

Option 1: Environment Variables (Recommended)

Option 2: Configuration File

Option 3: Interactive Setup

Configuration Management

Usage

Interactive Mode

Direct Query Mode

Examples: Basic Queries

Examples: Filtering & Sorting

Examples: Aggregations & Analysis

Examples: Working with Different Formats

Advanced Features

See the Generated SQL

Track Token Usage

Control Output Visibility

Scripting & Automation

Use Cases

Comparison with Other Tools

Roadmap

Development

Testing with Sample Data

Running Tests

Building & Releasing

FAQ

Troubleshooting

Contributing

License

Acknowledgments

Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance