Skip to main content

Query language for blending SQL logic and LLM reasoning across multi-modal data. [Findings of ACL 2024]

Project description


blendsql

SQL ๐Ÿค LLMs

Check out our online documentation for a more comprehensive overview.

Results from the paper are available here


pip install blendsql

BlendSQL is a superset of SQLite for problem decomposition and hybrid question-answering with LLMs.

As a result, we can Blend together...

  • ๐Ÿฅค ...operations over heterogeneous data sources (e.g. tables, text, images)
  • ๐Ÿฅค ...the structured & interpretable reasoning of SQL with the generalizable reasoning of LLMs

It can be viewed as an inversion of the typical text-to-SQL paradigm, where a user calls a LLM, and the LLM calls a SQL program.

Now, the user is given the control to oversee all calls (LLM + SQL) within a unified query language.

comparison

For example, imagine we have the following table titled parks, containing info on national parks in the United States.

We can use BlendSQL to build a travel planning LLM chatbot to help us navigate the options below.

Name Image Location Area Recreation Visitors (2022) Description
Death Valley death_valley.jpeg California, Nevada 3,408,395.63 acres (13,793.3 km2) 1,128,862 Death Valley is the hottest, lowest, and driest place in the United States, with daytime temperatures that have exceeded 130 ยฐF (54 ยฐC).
Everglades everglades.jpeg Alaska 7,523,897.45 acres (30,448.1 km2) 9,457 The country's northernmost park protects an expanse of pure wilderness in Alaska's Brooks Range and has no park facilities.
New River Gorge new_river_gorge.jpeg West Virgina 7,021 acres (28.4 km2) 1,593,523 The New River Gorge is the deepest river gorge east of the Mississippi River.
Katmai katmai.jpg Alaska 3,674,529.33 acres (14,870.3 km2) 33,908 This park on the Alaska Peninsula protects the Valley of Ten Thousand Smokes, an ash flow formed by the 1912 eruption of Novarupta.

BlendSQL allows us to ask the following questions by injecting "ingredients", which are callable functions denoted by double curly brackets ({{, }}).

Which parks don't have park facilities?

SELECT "Name", "Description" FROM parks
  WHERE {{
      LLMMap(
          'Does this location have park facilities?',
          context='parks::Description'
      )
  }} = FALSE
Name Description
Everglades The country's northernmost park protects an expanse of pure wilderness in Alaska's Brooks Range and has no park facilities.

What does the largest park in Alaska look like?

SELECT "Name",
{{ImageCaption('parks::Image')}} as "Image Description", 
{{
    LLMMap(
        question='Size in km2?',
        context='parks::Area'
    )
}} as "Size in km" FROM parks
WHERE "Location" = 'Alaska'
ORDER BY "Size in km" DESC LIMIT 1
Name Image Description Size in km
Everglades A forest of tall trees with a sunset in the background. 30448.1

Which state is the park in that protects an ash flow?

SELECT "Location", "Name" AS "Park Protecting Ash Flow" FROM parks 
    WHERE "Name" = {{
      LLMQA(
        'Which park protects an ash flow?',
        context=(SELECT "Name", "Description" FROM parks),
        options="parks::Name"
      ) 
  }}
Location Park Protecting Ash Flow
Alaska Katmai

How many parks are located in more than 1 state?

SELECT COUNT(*) FROM parks
    WHERE {{LLMMap('How many states?', 'parks::Location')}} > 1
Count
1

What's the difference in visitors for those parks with a superlative in their description vs. those without?

SELECT SUM(CAST(REPLACE("Recreation Visitors (2022)", ',', '') AS integer)) AS "Total Visitors", 
{{LLMMap('Contains a superlative?', 'parks::Description', options='t;f')}} AS "Description Contains Superlative",
GROUP_CONCAT(Name, ', ') AS "Park Names"
FROM parks
GROUP BY "Description Contains Superlative"
Total Visitors Description Contains Superlative Park Names
43365 0 Everglades, Katmai
2722385 1 Death Valley, New River Gorge

Now, we have an intermediate representation for our LLM to use that is explainable, debuggable, and very effective at hybrid question-answering tasks.

For in-depth descriptions of the above queries, check out our documentation.

Features

  • Supports many DBMS ๐Ÿ’พ
    • SQLite, PostgreSQL, DuckDB, Pandas (aka duckdb in a trenchcoat)
  • Supports many models โœจ
    • Transformers, OpenAI, Anthropic, Ollama
  • Easily extendable to multi-modal usecases ๐Ÿ–ผ๏ธ
  • Smart parsing optimizes what is passed to external functions ๐Ÿง 
    • Traverses abstract syntax tree with sqlglot to minimize LLM function calls ๐ŸŒณ
  • Constrained decoding with guidance ๐Ÿš€
  • LLM function caching, built on diskcache ๐Ÿ”‘

Quickstart

import pandas as pd

from blendsql import blend, LLMMap, LLMQA, LLMJoin
from blendsql.db import Pandas
from blendsql.models import TransformersLLM

# Load model
model = TransformersLLM('Qwen/Qwen1.5-0.5B')

# Prepare our local database
db = Pandas(
  {
    "w": pd.DataFrame(
      (
        ['11 jun', 'western districts', 'bathurst', 'bathurst ground', '11-0'],
        ['12 jun', 'wallaroo & university nsq', 'sydney', 'cricket ground',
         '23-10'],
        ['5 jun', 'northern districts', 'newcastle', 'sports ground', '29-0']
      ),
      columns=['date', 'rival', 'city', 'venue', 'score']
    ),
    "documents": pd.DataFrame(
      (
        ['bathurst, new south wales',
         'bathurst /หˆbรฆฮธษ™rst/ is a city in the central tablelands of new south wales , australia . it is about 200 kilometres ( 120 mi ) west-northwest of sydney and is the seat of the bathurst regional council .'],
        ['sydney',
         'sydney ( /หˆsษชdni/ ( listen ) sid-nee ) is the state capital of new south wales and the most populous city in australia and oceania . located on australia s east coast , the metropolis surrounds port jackson.'],
        ['newcastle, new south wales',
         'the newcastle ( /หˆnuหkษ‘หsษ™l/ new-kah-sษ™l ) metropolitan area is the second most populated area in the australian state of new south wales and includes the newcastle and lake macquarie local government areas .']
      ),
      columns=['title', 'content']
    )
  }
)

# Write BlendSQL query
blendsql = """
SELECT * FROM w
WHERE city = {{
    LLMQA(
        'Which city is located 120 miles west of Sydney?',
        (SELECT * FROM documents WHERE content LIKE '%sydney%'),
        options='w::city'
    )
}}
"""
smoothie = blend(
  query=blendsql,
  db=db,
  ingredients={LLMMap, LLMQA, LLMJoin},
  default_model=model,
  # Optional args below
  infer_gen_constraints=True,
  verbose=True
)
print(smoothie.df)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ date   โ”‚ rival             โ”‚ city     โ”‚ venue           โ”‚ score   โ”‚
# โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
# โ”‚ 11 jun โ”‚ western districts โ”‚ bathurst โ”‚ bathurst ground โ”‚ 11-0    โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
print(smoothie.meta.prompts)
# [
#   {
#       'answer': 'bathurst',
#       'question': 'Which city is located 120 miles west of Sydney?',
#       'context': [
#           {'title': 'bathurst, new south wales', 'content': 'bathurst /หˆbรฆฮธษ™rst/ is a city in the central tablelands of new south wales , australia . it is about...'},
#           {'title': 'sydney', 'content': 'sydney ( /หˆsษชdni/ ( listen ) sid-nee ) is the state capital of new south wales and the most populous city in...'}
#       ]
#    }
# ]

Citation

@article{glenn2024blendsql,
      title={BlendSQL: A Scalable Dialect for Unifying Hybrid Question Answering in Relational Algebra},
      author={Parker Glenn and Parag Pravin Dakle and Liang Wang and Preethi Raghavan},
      year={2024},
      eprint={2402.17882},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgements

Special thanks to those below for inspiring this project. Definitely recommend checking out the linked work below, and citing when applicable!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blendsql-0.0.26.tar.gz (108.7 kB view details)

Uploaded Source

Built Distribution

blendsql-0.0.26-py3-none-any.whl (130.1 kB view details)

Uploaded Python 3

File details

Details for the file blendsql-0.0.26.tar.gz.

File metadata

  • Download URL: blendsql-0.0.26.tar.gz
  • Upload date:
  • Size: 108.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for blendsql-0.0.26.tar.gz
Algorithm Hash digest
SHA256 dae4f91afa03cf1cc237ebf0ca2e11e2a3c983f2a3a25da4ed06544ba3cb8fc2
MD5 3060325f80e9fc443eac6e1a9d9cdccb
BLAKE2b-256 dc877fcb3cf4fe36fed7b2efdc39c4f72778c2fa70bff08656db8b730bff52bd

See more details on using hashes here.

File details

Details for the file blendsql-0.0.26-py3-none-any.whl.

File metadata

  • Download URL: blendsql-0.0.26-py3-none-any.whl
  • Upload date:
  • Size: 130.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for blendsql-0.0.26-py3-none-any.whl
Algorithm Hash digest
SHA256 123a426f91f223e440220ac1a93cbc4001369ecff26060610ea7d66d01d30f4c
MD5 f3abda90b1b5f0b8b2f7e10254c8ef0e
BLAKE2b-256 118453fb2d0537981f2438159c1ffd1fb53fc6261d38bd946ffb7ea14f3bc861

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page