End-to-End Local-First on premise Text-to-SQL Pipelines

These details have not been verified by PyPI

Project links

Homepage

Project description

PremSQL

End-to-End Local-First Text-to-SQL Pipelines

PremSQL is an open-source library designed to help developers create secure, fully local Text-to-SQL solutions using small language models. It provides all the essential tools to build and deploy end-to-end Text-to-SQL pipelines with customizable components, making it ideal for secure, autonomous AI-powered data analysis.

alt architecture

🚀 Features

Local-First: Avoid third-party closed-source providers and keep your data secure.
Customizable Datasets: Create, fine-tune, and evaluate models with built-in or custom datasets.
Robust Executors and Evaluators: Easily connect to databases and assess model performance.
Advanced Generators: Convert natural language prompts into executable SQL queries.
Error Handling and Self-Correction: Automatically correct SQL queries during inference.
Fine-Tuning Support: Fine-tune models with LoRA, QLoRA, or full fine-tuning strategies.
End-to-End Pipelines: Seamlessly integrate all components for autonomous data analysis.

Last but not the least, all the features are extendible for your very own customization and private data.

📚 Table of Contents

PremSQL

🛠️ Installation

PremSQL requires Python 3.8 or higher. Install the library via pip:

pip install premsql

🚀 Quickstart

Here’s a quick example of how to use PremSQL to generate SQL queries from natural language inputs:

from premsql.pipelines import SimpleText2SQLAgent
from premsql.generators import Text2SQLGeneratorHF
from premsql.executors import SQLiteExecutor

# Provide a SQLite file here or see documentation for more customization
dsn_or_db_path = "./data/db/california_schools.sqlite"

agent = SimpleText2SQLAgent(
    dsn_or_db_path=dsn_or_db_path,
    generator=Text2SQLGeneratorHF(
        model_or_name_or_path="premai-io/prem-1B-SQL",
        experiment_name="simple_pipeline",
        device="cuda:0",
        type="test"
    ),
)

question = "please list the phone numbers of the direct charter-funded schools that are opened after 2000/1/1"

response = agent.query(question)
response["table"]

📦 Components Overview

Datasets

PremSQL provides a simple API to use various pre-processed datasets for Text-to-SQL tasks. Text-to-SQL is complex as it requires data dependencies on databases and tables. The premsql datasets help streamline this by providing easy access to datasets and enabling you to create your own datasets with private databases.

Currently, the following datasets are readily available:

Example usage:

from premsql.datasets import Text2SQLDataset

bird_dataset = Text2SQLDataset(
    dataset_name='bird', split="train", force_download=False,
    dataset_folder="/path/to/your/data" # change this to the path where you want to store the dataset
)

Generators

PremSQL generators are responsible for converting natural language questions into SQL queries. Think of these as modular inference APIs specific to text-to-SQL. You can integrate various third-party APIs, models, or custom pipelines.

Example:

from premsql.generators import Text2SQLGeneratorHF
from premsql.datasets import Text2SQLDataset

# Define a dataset
dataset = bird_dataset = Text2SQLDataset(
    dataset_name='bird', split="train", force_download=False,
    dataset_folder="/path/to/dataset"
).setup_dataset(num_rows=10, num_fewshot=3)

# Define a generator 
generator = Text2SQLGeneratorHF(
    model_or_name_or_path="premai-io/prem-1B-SQL",
    experiment_name="test_generators",
    device="cuda:0",
    type="test"
)

# Generate on the full dataset
responses = generator.generate_and_save_results(
    dataset=bird_dataset,
    temperature=0.1,
    max_new_tokens=256
)

print(responses)

Results are saved in the experiment_path as predict.json.

We also support execution guided decoding. This strategy executes the generated SQL against the DB and, if it fails, uses the error message for correction, repeating until it gets a valid result or the retries run out.

alt text

A quick glance on execution guided decoding:

from premsql.executors import SQLiteExecutor

executor = SQLiteExecutor()
response = generator.generate_and_save_results(
    dataset=bird_dataset,
    temperature=0.1,
    max_new_tokens=256,
    force=True,
    executor=executor,
    max_retries=5 # this is optional (default is already set to 5)
)

Executors

An executor executes the generated SQL queries against the database and fetches the results. It is a crucial component in the Text-to-SQL pipeline, as it ensures that the generated SQL queries are valid and return the expected results. PremSQL supports a native executor for SQLite databases and also supports LangChain's SQLDatabase as an executor.

Example usage

from premsql.executors import SQLiteExecutor

# Instantiate the executor
executor = SQLiteExecutor()

# Set a sample dataset path 
db_path = "./data/db/california_schools.sqlite"
sql = 'SELECT movie_title FROM movies WHERE movie_release_year = 1945 ORDER BY movie_popularity DESC LIMIT 1'

# execute the SQL
result = executor.execute_sql(
    sql=sql,
    dsn_or_db_path=db_path
)

print(result)

This will show:

{'result': [('Brief Encounter',)], 'error': None, 'execution_time': 0.03717160224914551}

Evaluators

Executors connect to databases and execute SQL, while evaluators assess the performance of your models against predefined metrics like Execution Accuracy (EX) and Valid Efficiency Score (VES).

Example Usage:

from premsql.executors import SQLiteExecutor
from premsql.evaluator import Text2SQLEvaluator

# Define the executor 
executor = SQLiteExecutor()

# Define the evaluator 
evaluator = Text2SQLEvaluator(
    executor=executor,
    experiment_path=generator.experiment_path
)

# Now evaluate the models 
results = evaluator.execute(
    metric_name="accuracy",
    model_responses=response,
    filter_by="db_id",
    meta_time_out=10
)

print(results)

Using the filter_by option to filter results by db_id allows you to see overall accuracy and its distribution across different databases. If a key like difficulty is available, it will show performance distribution over various difficulty levels. Filtering evaluations by available keys helps in analyzing and understanding model performance empirically. Below is a visualization of model performance across different databases based on the applied filters.

alt text

Error Handling

Error-handling prompts are crucial for refining model performance, especially in complex tasks like Text-to-SQL generation. The prompts help the model learn how to handle errors by providing additional context and guidance based on past mistakes. By training on these prompts, the model can self-correct during inference, improving the quality of its output.

Example Error Correction Prompt:

{existing_prompt}

# Generated SQL: {sql}

## Error Message

{error_msg}

Carefully review the original question and error message, then rewrite the SQL query to address the identified issues.

Tuner

premsql tuner is a module designed to fine-tune models specifically for text-to-SQL tasks. The module offers multiple ways of fine-tuning, providing flexibility based on the needs of your project.

Supported Fine-Tuning Methods

Full Fine-Tuning: Standard fine-tuning of the model with all its parameters.
PEFT using LoRA: Parameter-Efficient Fine-Tuning with LoRA (Low-Rank Adaptation) for faster and efficient training.
PEFT using QLoRA: Another PEFT approach using Quantized LoRA, optimizing resource use during training.

In addition to these methods, you can create custom fine-tuning pipelines using the components and tools provided by premsql.

Pipelines

PremSQL pipelines are end-to-end solutions that connect to your database and generate SQL queries from natural language questions, providing complete control over your data analysis workflows.

Example Simple Pipeline:

from premsql.pipelines.simple import SimpleText2SQLAgent
from premsql.generators.huggingface import Text2SQLGeneratorHF
from langchain_community.utilities.sql_database import SQLDatabase
from premsql.utils import convert_sqlite_path_to_dsn

# Change it some SQLite database path or any other DB URI connection.
dsn_or_db_path = convert_sqlite_path_to_dsn(
  "../data/bird/test/test_databases/california_schools/california_schools.sqlite"   
)
db = SQLDatabase.from_uri(dsn_or_db_path)

agent = SimpleText2SQLAgent(
    dsn_or_db_path=db,
    generator=Text2SQLGeneratorHF(
        model_or_name_or_path="premai-io/prem-1B-SQL",
        experiment_name="test_nli",
        device="cuda:0",
        type="test"
    ),
)

response = agent.query(
    question="please list the phone numbers of the direct charter-funded schools that are opened after 2000/1/1",
)

response["table"]

🤝 Contributing

We welcome contributions from the community! If you’d like to contribute to PremSQL, please follow these guidelines:

Fork the repository and clone your fork.
Create a new branch for your feature or bug fix.
Make your changes and ensure the code passes all tests.
Submit a pull request with a clear description of your changes.

For detailed guidelines, please check the CONTRIBUTING.md.

🛣️ Roadmap

PremSQL is continuously evolving, with exciting features planned for future releases:

Synthesizer Component: A tool to generate synthetic datasets from private data, enabling fully private text-to-SQL workflows and enhancing model fine-tuning capabilities.
Agentic Pipelines with Function-Calling Features: Advanced pipelines with graph plotting, natural language analysis, and other enhancements to provide a more versatile and powerful system.
Training Better Small Language Models: Ongoing training and optimization of small language models specifically tailored to PremSQL’s unique requirements, ensuring efficient and effective performance in text-to-SQL tasks.
Optimization of Generators and Executors: Improvements to enhance the robustness of existing components, including parallel processing to speed up generation and execution times.
Standard Tests and Stability Improvements: Introduction of comprehensive tests for greater stability of the library and the planned rollout of a simple user interface to improve the overall user experience.

Stay tuned for these exciting updates! We encourage you to contribute and provide feedback to help us shape the future of PremSQL.

📝 License

PremSQL is licensed under the MIT License. See the LICENSE file for more information.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.10

Nov 11, 2024

0.2.0b1 pre-release

Nov 11, 2024

0.1.12

Nov 11, 2024

0.1.11

Nov 7, 2024

0.1.10

Nov 7, 2024

0.1.9

Nov 7, 2024

This version

0.1.1

Sep 9, 2024

0.1.0

Sep 9, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

premsql-0.1.1.tar.gz (28.0 kB view details)

Uploaded Sep 9, 2024 Source

Built Distribution

premsql-0.1.1-py3-none-any.whl (36.0 kB view details)

Uploaded Sep 9, 2024 Python 3

File details

Details for the file premsql-0.1.1.tar.gz.

File metadata

Download URL: premsql-0.1.1.tar.gz
Upload date: Sep 9, 2024
Size: 28.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for premsql-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`6051ab05ab4ae5b776c532048d9d9cc60aa472403724469e6a384f726637aedf`
MD5	`8cbeabf5a630ef7c078877bd9ecde346`
BLAKE2b-256	`816c15e4a26a519416797830c84038eb8b80aee218e61ed6f94ababb4e4d6eb8`

See more details on using hashes here.

File details

Details for the file premsql-0.1.1-py3-none-any.whl.

File metadata

Download URL: premsql-0.1.1-py3-none-any.whl
Upload date: Sep 9, 2024
Size: 36.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for premsql-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d8025d1b1ecb5e27b83fddd2d3815d61f8e9dcf6ed121b74405d2dc1150028e`
MD5	`c47edf1234eaec5449983c4107ddaf71`
BLAKE2b-256	`a76947615fc511529c8bea9b1e31fd341475972d087c3f7199bf8257fde5ee65`