Query language for blending SQL logic and LLM reasoning across multi-modal data.
Project description
SQL ๐ค LLMs
Check out our online documentation for a more comprehensive overview.
Results from the paper are available here
pip install blendsql
BlendSQL is a superset of SQLite for problem decomposition and hybrid question-answering with LLMs.
As a result, we can Blend together...
- ๐ฅค ...operations over heterogeneous data sources (e.g. tables, text, images)
- ๐ฅค ...the structured & interpretable reasoning of SQL with the generalizable reasoning of LLMs
It can be viewed as an inversion of the typical text-to-SQL paradigm, where a user calls a LLM, and the LLM calls a SQL program.
Now, the user is given the control to oversee all calls (LLM + SQL) within a unified query language.
For example, imagine we have the following table titled parks
, containing info on national parks in the United States.
We can use BlendSQL to build a travel planning LLM chatbot to help us navigate the options below.
Name | Image | Location | Area | Recreation Visitors (2022) | Description |
---|---|---|---|---|---|
Death Valley | California, Nevada | 3,408,395.63 acres (13,793.3 km2) | 1,128,862 | Death Valley is the hottest, lowest, and driest place in the United States, with daytime temperatures that have exceeded 130 ยฐF (54 ยฐC). | |
Everglades | Alaska | 7,523,897.45 acres (30,448.1 km2) | 9,457 | The country's northernmost park protects an expanse of pure wilderness in Alaska's Brooks Range and has no park facilities. | |
New River Gorge | West Virgina | 7,021 acres (28.4 km2) | 1,593,523 | The New River Gorge is the deepest river gorge east of the Mississippi River. | |
Katmai | Alaska | 3,674,529.33 acres (14,870.3 km2) | 33,908 | This park on the Alaska Peninsula protects the Valley of Ten Thousand Smokes, an ash flow formed by the 1912 eruption of Novarupta. |
BlendSQL allows us to ask the following questions by injecting "ingredients", which are callable functions denoted by double curly brackets ({{
, }}
).
Which parks don't have park facilities?
SELECT "Name", "Description" FROM parks
WHERE {{
LLMMap(
'Does this location have park facilities?',
context='parks::Description'
)
}} = FALSE
Name | Description |
---|---|
Everglades | The country's northernmost park protects an expanse of pure wilderness in Alaska's Brooks Range and has no park facilities. |
What does the largest park in Alaska look like?
SELECT "Name",
{{ImageCaption('parks::Image')}} as "Image Description",
{{
LLMMap(
question='Size in km2?',
context='parks::Area'
)
}} as "Size in km" FROM parks
WHERE "Location" = 'Alaska'
ORDER BY "Size in km" DESC LIMIT 1
Name | Image Description | Size in km |
---|---|---|
Everglades | A forest of tall trees with a sunset in the background. | 30448.1 |
Which state is the park in that protects an ash flow?
SELECT "Location", "Name" AS "Park Protecting Ash Flow" FROM parks
WHERE "Name" = {{
LLMQA(
'Which park protects an ash flow?',
context=(SELECT "Name", "Description" FROM parks),
options="parks::Name"
)
}}
Location | Park Protecting Ash Flow |
---|---|
Alaska | Katmai |
How many parks are located in more than 1 state?
SELECT COUNT(*) FROM parks
WHERE {{LLMMap('How many states?', 'parks::Location')}} > 1
Count |
---|
1 |
What's the difference in visitors for those parks with a superlative in their description vs. those without?
SELECT SUM(CAST(REPLACE("Recreation Visitors (2022)", ',', '') AS integer)) AS "Total Visitors",
{{LLMMap('Contains a superlative?', 'parks::Description', options='t;f')}} AS "Description Contains Superlative",
GROUP_CONCAT(Name, ', ') AS "Park Names"
FROM parks
GROUP BY "Description Contains Superlative"
Total Visitors | Description Contains Superlative | Park Names |
---|---|---|
43365 | 0 | Everglades, Katmai |
2722385 | 1 | Death Valley, New River Gorge |
Now, we have an intermediate representation for our LLM to use that is explainable, debuggable, and very effective at hybrid question-answering tasks.
For in-depth descriptions of the above queries, check out our documentation.
Features
- Supports many DBMS ๐พ
- SQLite, PostgreSQL, DuckDB, Pandas (aka duckdb in a trenchcoat)
- Supports many models โจ
- Transformers, Llama.cpp, OpenAI, Ollama
- Easily extendable to multi-modal usecases ๐ผ๏ธ
- Smart parsing optimizes what is passed to external functions ๐ง
- Traverses abstract syntax tree with sqlglot to minimize LLM function calls ๐ณ
- Constrained decoding with outlines ๐
- LLM function caching, built on diskcache ๐
Quickstart
import pandas as pd
from blendsql import blend, LLMMap, LLMQA, LLMJoin
from blendsql.db import Pandas
from blendsql.models import TransformersLLM
# Load model
model = TransformersLLM('Qwen/Qwen1.5-0.5B')
# Prepare our local database
db = Pandas(
{
"w": pd.DataFrame(
(
['11 jun', 'western districts', 'bathurst', 'bathurst ground', '11-0'],
['12 jun', 'wallaroo & university nsq', 'sydney', 'cricket ground',
'23-10'],
['5 jun', 'northern districts', 'newcastle', 'sports ground', '29-0']
),
columns=['date', 'rival', 'city', 'venue', 'score']
),
"documents": pd.DataFrame(
(
['bathurst, new south wales',
'bathurst /หbรฆฮธษrst/ is a city in the central tablelands of new south wales , australia . it is about 200 kilometres ( 120 mi ) west-northwest of sydney and is the seat of the bathurst regional council .'],
['sydney',
'sydney ( /หsษชdni/ ( listen ) sid-nee ) is the state capital of new south wales and the most populous city in australia and oceania . located on australia s east coast , the metropolis surrounds port jackson.'],
['newcastle, new south wales',
'the newcastle ( /หnuหkษหsษl/ new-kah-sษl ) metropolitan area is the second most populated area in the australian state of new south wales and includes the newcastle and lake macquarie local government areas .']
),
columns=['title', 'content']
)
}
)
# Write BlendSQL query
blendsql = """
SELECT * FROM w
WHERE city = {{
LLMQA(
'Which city is located 120 miles west of Sydney?',
(SELECT * FROM documents WHERE content LIKE '%sydney%'),
options='w::city'
)
}}
"""
smoothie = blend(
query=blendsql,
db=db,
ingredients={LLMMap, LLMQA, LLMJoin},
default_model=model,
# Optional args below
infer_gen_constraints=True,
verbose=True
)
print(smoothie.df)
# โโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโ
# โ date โ rival โ city โ venue โ score โ
# โโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโค
# โ 11 jun โ western districts โ bathurst โ bathurst ground โ 11-0 โ
# โโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโ
print(smoothie.meta.prompts)
# [
# {
# 'answer': 'bathurst',
# 'question': 'Which city is located 120 miles west of Sydney?',
# 'context': [
# {'title': 'bathurst, new south wales', 'content': 'bathurst /หbรฆฮธษrst/ is a city in the central tablelands of new south wales , australia . it is about...'},
# {'title': 'sydney', 'content': 'sydney ( /หsษชdni/ ( listen ) sid-nee ) is the state capital of new south wales and the most populous city in...'}
# ]
# }
# ]
Citation
@article{glenn2024blendsql,
title={BlendSQL: A Scalable Dialect for Unifying Hybrid Question Answering in Relational Algebra},
author={Parker Glenn and Parag Pravin Dakle and Liang Wang and Preethi Raghavan},
year={2024},
eprint={2402.17882},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Acknowledgements
Special thanks to those below for inspiring this project. Definitely recommend checking out the linked work below, and citing when applicable!
- The authors of Binding Language Models in Symbolic Languages
- This paper was the primary inspiration for BlendSQL.
- The authors of EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images
- As far as I can tell, the first publication to propose unifying model calls within SQL
- Served as the inspiration for the vqa-ingredient.ipynb example
- The authors of Grammar Prompting for Domain-Specific Language Generation with Large Language Models
- The maintainers of the Outlines library for powering the constrained decoding capabilities of BlendSQL
- Paper at https://arxiv.org/abs/2307.09702
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file blendsql-0.0.21.tar.gz
.
File metadata
- Download URL: blendsql-0.0.21.tar.gz
- Upload date:
- Size: 142.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46ff512e2d068df5765a394b5ddea877b11a814da3242f837759d16242edb9e6 |
|
MD5 | 084dce0844c2a06585f1751a9a0e8665 |
|
BLAKE2b-256 | 0696796650c7c15b123b0cd9e3ddd67be4df13b3878ba485ae39bc033746d284 |
File details
Details for the file blendsql-0.0.21-py3-none-any.whl
.
File metadata
- Download URL: blendsql-0.0.21-py3-none-any.whl
- Upload date:
- Size: 141.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c941f36b349d9e836fdacdfa51a43d196937a986e1532dcba31f949d841eb11 |
|
MD5 | de48b175f0738f2edcc48afd216500c6 |
|
BLAKE2b-256 | a7c6ebe79ad73df83abc6732da063442cffc616fcb785654cd6d88ead85720c0 |