Query local or remote data files with natural language queries powered by OpenAI and DuckDB.

These details have not been verified by PyPI

Project description

qabot

Query local or remote files with natural language queries powered by OpenAI's gpt and duckdb 🦆.

Can query Wikidata and local files.

Command Line Usage

$ EXPORT OPENAI_API_KEY=sk-...
$ EXPORT QABOT_MODEL_NAME=gpt-4
$ qabot -w -q "How many Hospitals are there located in Beijing"
Query: How many Hospitals are there located in Beijing
There are 39 hospitals located in Beijing.
Total tokens 1749 approximate cost in USD: 0.05562

Python Usage

from qabot import ask_wikidata, ask_file

print(ask_wikidata("How many hospitals are there in New Zealand?"))
print(ask_file("How many men were aboard the titanic?", 'data/titanic.csv'))

Output:

There are 54 hospitals in New Zealand.
There were 577 male passengers on the Titanic.

Features

Works on local CSV files:

remote CSV files:

$ qabot -f https://duckdb.org/data/holdings.csv -q "Tell me how many Apple holdings I currently have"
 🦆 Creating local DuckDB database...
 🦆 Loading data...
create view 'holdings' as select * from 'https://duckdb.org/data/holdings.csv';
 🚀 Sending query to LLM
 🧑 Tell me how many Apple holdings I currently have


 🤖 You currently have 32.23 shares of Apple.


This information was obtained by summing up all the Apple ('APPL') shares in the holdings table.

SELECT SUM(shares) as total_shares FROM holdings WHERE ticker = 'APPL'

Even on (public) data stored in S3:

You can even load data from disk/URL via the natural language query:

qabot -q "Load the file 'data/titanic.csv' into a table called 'raw_passengers'. Create a view of the raw passengers table for just the male passengers. What was the average fare for surviving male passengers?" -v
 🦆 Creating local DuckDB database...
 🤖 Using model: gpt-3.5-turbo. Max LLM/function iterations before answer 20
 🚀 Sending query to LLM
 🧑 Load the file 'data/titanic.csv' into a table called 'raw_passengers'. Create a view of the raw passengers table for just the male passengers. What was the average fare for surviving male passengers?
 🤖 load_data
{'files': ['data/titanic.csv']}
 🦆 Imported with SQL:
["create table 'titanic' as select * from 'data/titanic.csv';"]
 🤖 show_tables
 🦆 name
qabot_queries
titanic
 🤖 describe_table
{'table': 'titanic'}
 🦆 titanic
column_name,data_type
PassengerId,BIGINT
Survived,BIGINT
Pclass,BIGINT
Name,VARCHAR
Sex,VARCHAR
Age,DOUBLE
SibSp,BIGINT
Parch,BIGINT
Ticket,VARCHAR
Fare,DOUBLE
Cabin,VARCHAR
Embarked,VARCHAR
select count(*) from 'titanic';
count_star()
891
select * from 'titanic' limit 3;
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,Braund, Mr. Owen Harris,male,22.0,1,0,A/5 21171,7.25,None,S
2,1,1,Cumings, Mrs. John Bradley (Florence Briggs Thayer),female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,Heikkinen, Miss. Laina,female,26.0,0,0,STON/O2. 3101282,7.925,None,S
 🤖 execute_sql
{'query': "CREATE VIEW male_passengers AS SELECT * FROM titanic WHERE Sex = 'male'"}
 🦆 No output
 🤖 describe_table
{'table': 'male_passengers'}
 🦆 male_passengers
column_name,data_type
PassengerId,BIGINT
Survived,BIGINT
Pclass,BIGINT
Name,VARCHAR
Sex,VARCHAR
Age,DOUBLE
SibSp,BIGINT
Parch,BIGINT
Ticket,VARCHAR
Fare,DOUBLE
Cabin,VARCHAR
Embarked,VARCHAR
select count(*) from 'male_passengers';
count_star()
577
select * from 'male_passengers' limit 3;
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,Braund, Mr. Owen Harris,male,22.0,1,0,A/5 21171,7.25,None,S
5,0,3,Allen, Mr. William Henry,male,35.0,0,0,373450,8.05,None,S
6,0,3,Moran, Mr. James,male,None,0,0,330877,8.4583,None,Q
 🤖 execute_sql
{'query': 'SELECT AVG(Fare) AS average_fare FROM male_passengers WHERE Survived = 1'}
 🦆 average_fare
40.82148440366974
 🦆 {'summary': 'The average fare for surviving male passengers was $40.82.', 'detail': "To calculate the average fare for surviving male passengers, I created a view called 'male_passengers' that contains only the male passengers from the 'titanic' 
table. Then, I executed the SQL query `SELECT AVG(Fare) AS average_fare FROM male_passengers WHERE Survived = 1` to calculate the average fare for the surviving male passengers."}


 🚀 Question:
 🧑 Load the file 'data/titanic.csv' into a table called 'raw_passengers'. Create a view of the raw passengers table for just the male passengers. What was the average fare for surviving male passengers?
 🤖 The average fare for surviving male passengers was $40.82.


To calculate the average fare for surviving male passengers, I created a view called 'male_passengers' that contains only the male passengers from the 'titanic' table. Then, I executed the SQL query `SELECT AVG(Fare) AS average_fare FROM 
male_passengers WHERE Survived = 1` to calculate the average fare for the surviving male passengers.

Quickstart

You need to set the OPENAI_API_KEY environment variable to your OpenAI API key, which you can get from here.

Install the qabot command line tool using pip/poetry:

$ pip install -U qabot

Then run the qabot command with either local files (-f my-file.csv) or -w to query wikidata.

Examples

Local CSV file/s

$ qabot -q "how many passengers survived by gender?" -f data/titanic.csv
🦆 Loading data from files...
Loading data/titanic.csv into table titanic...

Query: how many passengers survived by gender?
Result:
There were 233 female passengers and 109 male passengers who survived.


 🚀 any further questions? [y/n] (y): y

 🚀 Query: what was the largest family who did not survive? 
Query: what was the largest family who did not survive?
Result:
The largest family who did not survive was the Sage family, with 8 members.

 🚀 any further questions? [y/n] (y): n

Query WikiData

Use the -w flag to query wikidata. For best results use the gpt-4 model.

$ EXPORT QABOT_MODEL_NAME=gpt-4
$ qabot -w -q "How many Hospitals are there located in Beijing"

Intermediate steps and database queries

Use the -v flag to see the intermediate steps and database queries. Sometimes it takes a long route to get to the answer, but it's interesting to see how it gets there.

qabot -f data/titanic.csv -q "how many passengers survived by gender?" -v

Data accessed via http/s3

Use the -f <url> flag to load data from a url, e.g. a csv file on s3:

$ qabot -f s3://covid19-lake/enigma-jhu-timeseries/csv/jhu_csse_covid_19_timeseries_merged.csv -q "how many confirmed cases of covid are there?" -v
🦆 Loading data from files...
create table jhu_csse_covid_19_timeseries_merged as select * from 's3://covid19-lake/enigma-jhu-timeseries/csv/jhu_csse_covid_19_timeseries_merged.csv';

Result:
264308334 confirmed cases

Ideas

streaming mode to output results as they come in
token limits
Supervisor agent - assess whether a query is "safe" to run, could ask for user confirmation to run anything that gets flagged.
Often we can zero-shot the question and get a single query out - perhaps we try this before the MKL chain
test each zeroshot agent individually
Generate and pass back assumptions made to the user
Add an optional "clarify" tool to the chain that asks the user to clarify the question
Create a query checker tool that checks if the query looks valid and/or safe
Inject AWS credentials into duckdb so we can access private resources in S3
Automatic publishing to pypi. Look at https://blog.pypi.org/posts/2023-04-20-introducing-trusted-publishers/

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

Sep 16, 2024

0.5.9

Aug 29, 2024

0.5.8

Jun 6, 2024

0.5.7

Mar 16, 2024

0.5.6

Mar 14, 2024

0.5.5

Mar 12, 2024

0.5.4

Mar 12, 2024

0.5.3

Mar 12, 2024

0.5.2

Mar 11, 2024

0.5.1

Mar 8, 2024

0.5.0

Mar 8, 2024

0.4.7

Nov 11, 2023

0.4.6

Nov 8, 2023

0.4.5

Nov 8, 2023

This version

0.4.4

Sep 18, 2023

0.4.3

Sep 17, 2023

0.4.2

Sep 17, 2023

0.4.1

Sep 16, 2023

0.4.0

Sep 16, 2023

0.3.4

Apr 20, 2023

0.3.3

Apr 12, 2023

0.3.2

Apr 12, 2023

0.3.1

Apr 12, 2023

0.3.0

Apr 12, 2023

0.2.15

Apr 12, 2023

0.2.14

Apr 9, 2023

0.2.13

Apr 2, 2023

0.2.12

Apr 1, 2023

0.2.11

Apr 1, 2023

0.2.10

Mar 24, 2023

0.2.9

Mar 23, 2023

0.2.8

Mar 11, 2023

0.2.7

Mar 11, 2023

0.2.6

Mar 8, 2023

0.2.5

Mar 8, 2023

0.2.4

Mar 8, 2023

0.2.3

Mar 6, 2023

0.2.2

Mar 6, 2023

0.2.1

Mar 5, 2023

0.2.0

Mar 5, 2023

0.1.5

Mar 4, 2023

0.1.4

Mar 4, 2023

0.1.2

Mar 4, 2023

0.1.1

Mar 3, 2023

0.1.0

Mar 3, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qabot-0.4.4.tar.gz (18.7 kB view hashes)

Uploaded Sep 18, 2023 Source

Built Distribution

qabot-0.4.4-py3-none-any.whl (20.1 kB view hashes)

Uploaded Sep 18, 2023 Python 3

Hashes for qabot-0.4.4.tar.gz

Hashes for qabot-0.4.4.tar.gz
Algorithm	Hash digest
SHA256	`5c11b9df5288de17c7109a5711eb129c606d34135c0f604e33bf40f491050687`
MD5	`dedf478ac94953cedebc797fcdda3878`
BLAKE2b-256	`f2b550bd92234a55e8bc8d81ebfd02c6e13fd20dedfc38be462d3296071fba6b`

Hashes for qabot-0.4.4-py3-none-any.whl

Hashes for qabot-0.4.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ccb33e6bf9cb67d0ae6e6ac0ea304bfd581bea840b73d6efafe595945006e116`
MD5	`dc0f6020f144432698b873040e4d1d5a`
BLAKE2b-256	`7403d8c107283e4c2663e0b88446a58301bb3d1d611ef2ee7b87b551579c4ccc`