Skip to main content

Query local or remote data files with natural language queries powered by OpenAI and DuckDB.

Project description

qabot

Query local or remote files with natural language queries powered by OpenAI's gpt and duckdb 🦆.

Can query local and remote files (CSV, parquet)

Installation

Install with pipx, pip etc:

pipx install qabot

Security Risks

This program gives an LLM access to your local and network accessible files and allows it to execute arbitrary SQL queries, see Security.md for more information.

Command Line Usage

$ EXPORT OPENAI_API_KEY=sk-...
$ EXPORT QABOT_MODEL_NAME=gpt-4o
$ qabot -w -q "How many Hospitals are there located in Beijing"
Query: How many Hospitals are there located in Beijing
There are 39 hospitals located in Beijing.
Total tokens 1749 approximate cost in USD: 0.05562

Python Usage

from qabot import ask_wikidata, ask_file, ask_database

print(ask_wikidata("How many hospitals are there in New Zealand?"))
print(ask_file("How many men were aboard the titanic?", 'data/titanic.csv'))
print(ask_database("How many product images are there?", 'postgresql://user:password@localhost:5432/dbname'))

Output:

There are 54 hospitals in New Zealand.
There were 577 male passengers on the Titanic.
There are 6,225 product images.

Features

Works on local CSV and Excel files:

remote CSV files:

$ qabot -f https://duckdb.org/data/holdings.csv -q "Tell me how many Apple holdings I currently have"
 🦆 Creating local DuckDB database...
 🦆 Loading data...
create view 'holdings' as select * from 'https://duckdb.org/data/holdings.csv';
 🚀 Sending query to LLM
 🧑 Tell me how many Apple holdings I currently have


 🤖 You currently have 32.23 shares of Apple.


This information was obtained by summing up all the Apple ('APPL') shares in the holdings table.

SELECT SUM(shares) as total_shares FROM holdings WHERE ticker = 'APPL'

Even on (public) data stored in S3:

You can even load data from disk/URL via the natural language query:

Load the file 'data/titanic.csv' into a table called 'raw_passengers'. Create a view of the raw passengers table for just the male passengers. What was the average fare for surviving male passengers?

~/Dev/qabot> qabot -q "Load the file 'data/titanic.csv' into a table called 'raw_passengers'. Create a view of the raw passengers table for just the male passengers. What was the average fare for surviving male passengers?" -v
 🦆 Creating local DuckDB database...
 🤖 Using model: gpt-4-1106-preview. Max LLM/function iterations before answer 20
 🚀 Sending query to LLM
 🧑 Load the file 'data/titanic.csv' into a table called 'raw_passengers'. Create a view of the raw passengers table for just the male passengers. What was the    
average fare for surviving male passengers?
 🤖 load_data
{'files': ['data/titanic.csv']}
 🦆 Imported with SQL:
["create table 'titanic' as select * from 'data/titanic.csv';"]
 🤖 execute_sql
{'query': "CREATE VIEW male_passengers AS SELECT * FROM titanic WHERE Sex = 'male';"}
 🦆 No output
 🤖 execute_sql
{'query': 'SELECT AVG(Fare) as average_fare FROM male_passengers WHERE Survived = 1;'}
 🦆 average_fare
40.82148440366974
 🦆 {"summary": "The average fare for surviving male passengers was approximately $40.82.", "detail": "The average fare for surviving male passengers was
calculated by creating a view called `male_passengers` to filter only the male passengers from the `titanic` table, and then running a query to calculate the      
average fare for male passengers who survived. The calculated average fare is approximately $40.82.", "query": "CREATE VIEW male_passengers AS SELECT * FROM       
titanic WHERE Sex = 'male';\nSELECT AVG(Fare) as average_fare FROM male_passengers WHERE Survived = 1;"}


 🚀 Question:
 🧑 Load the file 'data/titanic.csv' into a table called 'raw_passengers'. Create a view of the raw passengers table for just the male passengers. What was the    
average fare for surviving male passengers?
 🤖 The average fare for surviving male passengers was approximately $40.82.


The average fare for surviving male passengers was calculated by creating a view called `male_passengers` to filter only the male passengers from the `titanic`    
table, and then running a query to calculate the average fare for male passengers who survived. The calculated average fare is approximately $40.82.

CREATE VIEW male_passengers AS SELECT * FROM titanic WHERE Sex = 'male';
SELECT AVG(Fare) as average_fare FROM male_passengers WHERE Survived = 1;

Quickstart

You need to set the OPENAI_API_KEY environment variable to your OpenAI API key, which you can get from here.

Install the qabot command line tool using pip/pipx:

$ pip install -U qabot

Then run the qabot command with either local files (-f my-file.csv) or -w to query wikidata.

See all options with qabot --help

Examples

Local CSV file/s

$ qabot -q "how many passengers survived by gender?" -f data/titanic.csv
🦆 Loading data from files...
Loading data/titanic.csv into table titanic...

Query: how many passengers survived by gender?
Result:
There were 233 female passengers and 109 male passengers who survived.


 🚀 any further questions? [y/n] (y): y

 🚀 Query: what was the largest family who did not survive? 
Query: what was the largest family who did not survive?
Result:
The largest family who did not survive was the Sage family, with 8 members.

 🚀 any further questions? [y/n] (y): n

Query WikiData

Use the -w flag to query wikidata. For best results use a gpt-4 or similar model.

$ EXPORT QABOT_MODEL_NAME=gpt-4
$ qabot -w -q "How many Hospitals are there located in Beijing"

Intermediate steps and database queries

Use the -v flag to see the intermediate steps and database queries. Sometimes it takes a long route to get to the answer, but it's interesting to see how it gets there.

qabot -f data/titanic.csv -q "how many passengers survived by gender?" -v

Data accessed via http/s3

Use the -f <url> flag to load data from a url, e.g. a csv file on s3:

$ qabot -f s3://covid19-lake/enigma-jhu-timeseries/csv/jhu_csse_covid_19_timeseries_merged.csv -q "how many confirmed cases of covid are there?" -v
🦆 Loading data from files...
create table jhu_csse_covid_19_timeseries_merged as select * from 's3://covid19-lake/enigma-jhu-timeseries/csv/jhu_csse_covid_19_timeseries_merged.csv';

Result:
264308334 confirmed cases

Docker Usage

You can build and run the Docker image for qabot using the following instructions:

Building the Docker Image

To build the Docker image, run the following command in the root directory of the repository:

docker build -t qabot .

Running the Docker Image

To run the Docker image, use the following command:

docker run --rm -e OPENAI_API_KEY=your_openai_api_key ghcr.io/hardbyte/qabot -w -q "How many Hospitals are there located in Beijing"

Replace your_openai_api_key with your actual OpenAI API key.

Ideas

  • Streaming mode to output results as they come in
  • token limits and better reporting of costs
  • Supervisor agent - assess whether a query is "safe" to run, could ask for user confirmation to run anything that gets flagged.
  • Often we can zero-shot the question and get a single query out - perhaps we try this before the MKL chain
  • test each zeroshot agent individually
  • Generate and pass back assumptions made to the user
  • Add an optional "clarify" tool to the chain that asks the user to clarify the question
  • Create a query checker tool that checks if the query looks valid and/or safe
  • Inject AWS credentials into duckdb for access to private resources in S3
  • Automatic publishing to pypi e.g. using trusted publishers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qabot-0.6.0.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

qabot-0.6.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file qabot-0.6.0.tar.gz.

File metadata

  • Download URL: qabot-0.6.0.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for qabot-0.6.0.tar.gz
Algorithm Hash digest
SHA256 febaba3f9dda6a34b97f754cb7a01df78f78cab1af1d029b884ea70ba6e37628
MD5 e2a721af9ed3dd4c2954538344ccd7b0
BLAKE2b-256 14b018e0772dc722a75cde88592d827feae5c6754248f5c6b0644246a447fd99

See more details on using hashes here.

File details

Details for the file qabot-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: qabot-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for qabot-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6c709afbdd4668bec10097421e8f76404665b9b4a52baa01e038b70f86d5698d
MD5 54429191adc4ede9bf6b28cf57803f2d
BLAKE2b-256 c50a026ca2dd2d9fd863a71d1ad26d3813148145347a50c3af71c450829eaa70

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page