Skip to main content

Python SDK is designed to facilitate the development of scalable vector search applications on PostgreSQL databases.

Project description

Open Source Alternative for Building End-to-End Vector Search Applications without OpenAI & Pinecone

Table of Contents

Overview

Python SDK is designed to facilitate the development of scalable vector search applications on PostgreSQL databases. With this SDK, you can seamlessly manage various database tables related to documents, text chunks, text splitters, LLM (Language Model) models, and embeddings. By leveraging the SDK's capabilities, you can efficiently index LLM embeddings using PgVector for fast and accurate queries.

Key Features

  • Automated Database Management: With the SDK, you can easily handle the management of database tables related to documents, text chunks, text splitters, LLM models, and embeddings. This automated management system simplifies the process of setting up and maintaining your vector search application's data structure.

  • Embedding Generation from Open Source Models: The Python SDK provides the ability to generate embeddings using hundreds of open source models. These models, trained on vast amounts of data, capture the semantic meaning of text and enable powerful analysis and search capabilities.

  • Flexible and Scalable Vector Search: The Python SDK empowers you to build flexible and scalable vector search applications. The Python SDK seamlessly integrates with PgVector, a PostgreSQL extension specifically designed for handling vector-based indexing and querying. By leveraging these indices, you can perform advanced searches, rank results by relevance, and retrieve accurate and meaningful information from your database.

Use Cases

Embeddings, the core concept of the Python SDK, find applications in various scenarios, including:

  • Search: Embeddings are commonly used for search functionalities, where results are ranked by relevance to a query string. By comparing the embeddings of query strings and documents, you can retrieve search results in order of their similarity or relevance.

  • Clustering: With embeddings, you can group text strings by similarity, enabling clustering of related data. By measuring the similarity between embeddings, you can identify clusters or groups of text strings that share common characteristics.

  • Recommendations: Embeddings play a crucial role in recommendation systems. By identifying items with related text strings based on their embeddings, you can provide personalized recommendations to users.

  • Anomaly Detection: Anomaly detection involves identifying outliers or anomalies that have little relatedness to the rest of the data. Embeddings can aid in this process by quantifying the similarity between text strings and flagging outliers.

  • Classification: Embeddings are utilized in classification tasks, where text strings are classified based on their most similar label. By comparing the embeddings of text strings and labels, you can classify new text strings into predefined categories.

How the Python SDK Works

The Python SDK streamlines the development of vector search applications by abstracting away the complexities of database management and indexing. Here's an overview of how the SDK works:

  • Document and Text Chunk Management: The SDK provides a convenient interface to create, update, and delete documents and their corresponding text chunks. You can easily organize and structure your text data within the PostgreSQL database.

  • Open Source Model Integration: With the SDK, you can seamlessly incorporate a wide range of open source models to generate high-quality embeddings. These models capture the semantic meaning of text and enable powerful analysis and search capabilities.

  • Embedding Indexing: The Python SDK utilizes the PgVector extension to efficiently index the embeddings generated by the open source models. This indexing process optimizes search performance and allows for fast and accurate retrieval of relevant results.

  • Querying and Search: Once the embeddings are indexed, you can perform vector-based searches on the documents and text chunks stored in the PostgreSQL database. The SDK provides intuitive methods for executing queries and retrieving search results.

Quickstart

Follow the steps below to quickly get started with the Python SDK for building scalable vector search applications on PostgresML databases.

Prerequisites

Before you begin, make sure you have the following:

  • PostgresML Database: Ensure you have a PostgresML database version >2.3.1. You can spin up a database using Docker or sign up for a free GPU-powered database. Set the PGML_CONNECTION environment variable to the connection string of your PostgresML database. If not set, the SDK will use the default connection string for your local installation postgres://postgres@127.0.0.1:5433/pgml_development.

  • Python version >=3.8.1

  • postgresql command line utility

    • Ubuntu: sudo apt install libpq-dev
    • Centos/Fedora/Cygwin/Babun: sudo yum install libpq-devel
    • Mac: brew install postgresql

Installation

To install the Python SDK, use pip:

pip install pgml

Sample Code

Once you have the Python SDK installed, you can use the following sample code as a starting point for your vector search application:

from pgml import Database
import os
import json
from datasets import load_dataset
from time import time
from rich import print as rprint
import asyncio

async def main():
    local_pgml = "postgres://postgres@127.0.0.1:5433/pgml_development"
    conninfo = os.environ.get("PGML_CONNECTION", local_pgml)

    db = Database(conninfo)

    collection_name = "test_collection"
    collection = await db.create_or_get_collection(collection_name)

Explanation:

  • The code imports the necessary modules and packages, including pgml.Database, os, json, datasets, time, and rich.print.
  • It defines the local_pgml variable with the default local connection string, and retrieves the connection information from the PGML_CONNECTION environment variable or uses the default if not set.
  • An instance of the Database class is created by passing the connection information.
  • The method create_or_get_collection collection with the name test_pgml_sdk_1 is retrieved if it exists or a new collection is created.

Continuing within async def main():

    data = load_dataset("squad", split="train")
    data = data.to_pandas()
    data = data.drop_duplicates(subset=["context"])

    documents = [
        {'id': r['id'], "text": r["context"], "title": r["title"]}
        for r in data.to_dict(orient="records")
    ]

    await collection.upsert_documents(documents[:200])
    await collection.generate_chunks()
    await collection.generate_embeddings()

Explanation:

  • The code loads the "squad" dataset, converts it to a pandas DataFrame, and drops any duplicate context values.
  • It creates a list of dictionaries representing the documents to be indexed, with each dictionary containing the document's ID, text, and title.
  • The upsert_documents method is called to insert or update the first 200 documents in the collection.
  • The generate_chunks method splits the documents into smaller text chunks for efficient indexing and search.
  • The generate_embeddings method generates embeddings for the documents in the collection.

Continuing within async def main():

    start = time()
    results = await collection.vector_search("Who won 20 grammy awards?", top_k=2)
    rprint(json.dumps(results, indent=2))
    rprint("Query time: %0.3f seconds" % (time() - start))
    await db.archive_collection(collection_name)

Explanation:

  • The code initializes a timer using time() to measure the query time.
  • The vector_search method is called to perform a vector-based search on the collection. The query string is Who won 20 grammy awards?, and the top 2 results are requested.
  • The search results are printed using rprint and formatted as JSON with indentation.
  • The query time is calculated by subtracting the start time from the current time.
  • Finally, the archive_collection method is called to archive the collection and free up resources in the PostgresML database.

Call main function in an async loop.

if __name__ == "__main__":
    asyncio.run(main())    

Running the Code

Open a terminal or command prompt and navigate to the directory where the file is saved.

Execute the following command:

python vector_search.py

You should see the search results and the query time printed in the terminal. As you can see, our vector search engine found the right text chunk with the answer we are looking for.

[
  {
    "score": 0.8423336843624225,
    "chunk": "Beyonc\u00e9 has won 20 Grammy Awards, both as a solo artist and member of Destiny's Child, making her the second most honored female artist by the Grammys, behind Alison Krauss and the most nominated woman in Grammy Award 
history with 52 nominations. \"Single Ladies (Put a Ring on It)\" won Song of the Year in 2010 while \"Say My Name\" and \"Crazy in Love\" had previously won Best R&B Song. Dangerously in Love, B'Day and I Am... Sasha Fierce have all won Best 
Contemporary R&B Album. Beyonc\u00e9 set the record for the most Grammy awards won by a female artist in one night in 2010 when she won six awards, breaking the tie she previously held with Alicia Keys, Norah Jones, Alison Krauss, and Amy 
Winehouse, with Adele equaling this in 2012. Following her role in Dreamgirls she was nominated for Best Original Song for \"Listen\" and Best Actress at the Golden Globe Awards, and Outstanding Actress in a Motion Picture at the NAACP Image 
Awards. Beyonc\u00e9 won two awards at the Broadcast Film Critics Association Awards 2006; Best Song for \"Listen\" and Best Original Soundtrack for Dreamgirls: Music from the Motion Picture.",
    "metadata": {
      "title": "Beyonc\u00e9"
    }
  },
  {
    "score": 0.8210568000806665,
    "chunk": "A self-described \"modern-day feminist\", Beyonc\u00e9 creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dynamic, highly 
choreographed performances have led to critics hailing her as one of the best entertainers in contemporary popular music. Throughout a career spanning 19 years, she has sold over 118 million records as a solo artist, and a further 60 million 
with Destiny's Child, making her one of the best-selling music artists of all time. She has won 20 Grammy Awards and is the most nominated woman in the award's history. The Recording Industry Association of America recognized her as the Top 
Certified Artist in America during the 2000s decade. In 2009, Billboard named her the Top Radio Songs Artist of the Decade, the Top Female Artist of the 2000s and their Artist of the Millennium in 2011. Time listed her among the 100 most 
influential people in the world in 2013 and 2014. Forbes magazine also listed her as the most powerful female musician of 2015.",
    "metadata": {
      "title": "Beyonc\u00e9"
    }
  }
]

Usage

High-level Description

The Python SDK provides a set of functionalities to build scalable vector search applications on PostgresQL databases. It enables users to create a collection, which represents a schema in the database, to store tables for documents, chunks, models, splitters, and embeddings. The Collection class in the SDK handles all operations related to these tables, allowing users to interact with the collection and perform various tasks.

Connect to Database

local_pgml = "postgres://postgres@127.0.0.1:5433/pgml_development"

conninfo = os.environ.get("PGML_CONNECTION", local_pgml)
db = Database(conninfo)

This initializes a connection pool to the DB and creates a table named pgml.collections if it does not already exist. By default it connects to local PostgresML database and at one connection maintained in the connection pool.

Create or Get a Collection

collection_name = "test_collection"
collection = await db.create_or_get_collection(collection_name)

This creates a new schema in a PostgreSQL database if it does not already exist and creates tables and indices for documents, chunks, models, splitters, and embeddings.

Upsert Documents

await collection.upsert_documents(documents)

The method is used to insert or update documents in a database table based on their ID, text, and metadata.

Generate Chunks

await collection.generate_chunks(splitter_id = 1)

This method is used to generate chunks of text from unchunked documents using a specified text splitter. By default it uses RecursiveCharacterTextSplitter with default parameters. splitter_id is optional. You can pass a splitter_id corresponding to a new splitter that is registered. See below for register_text_splitter.

Generate Embeddings

await collection.generate_embeddings(model_id = 1, splitter_id = 1)

This methods generates embeddings uing the chunks from the text. By default it uses intfloat/e5-small embeddings model. model_id is optional. You can pass a model_id corresponding to a new model that is registered and splitter_id. See below for register_model.

Vector Search

results = await collection.vector_search("Who won 20 grammy awards?", top_k=2, model_id = 1, splitter_id = 1)

This method converts the input query into embeddings and searches embeddings table for nearest match. You can change the number of results using top_k. You can also pass specific splitter_id and model_id that were used for chunking and generating embeddings.

Register Model

await collection.register_model(model_name="hkunlp/instructor-xl", model_params={"instruction": "Represent the Wikipedia document for retrieval: "})

This function allows for the registration of a model in a database, creating a record if it does not already exist. model_name is the name of the open source HuggingFace model being registered and model_params is a dictionary containing parameters for configuring the model. It can be empty if no parameters are needed.

Register Text Splitter

await collection.register_text_splitter(splitter_name="recursive_character",splitter_params={"chunk_size": "100","chunk_overlap": "20"})

This function allows for the registration of a text spliter in a database, creating a record if it doesn't already exist. Following LangChain splitters are supported.

SPLITTERS = {
    "character": CharacterTextSplitter,
    "latex": LatexTextSplitter,
    "markdown": MarkdownTextSplitter,
    "nltk": NLTKTextSplitter,
    "python": PythonCodeTextSplitter,
    "recursive_character": RecursiveCharacterTextSplitter,
    "spacy": SpacyTextSplitter,
}

Developer Setup

This Python library is generated from our core rust-sdk. Please check rust-sdk documentation for developer setup.

API Reference

Roadmap

  • Enable filters on document metadata in vector_search. Issue
  • text_search functionality on documents using Postgres text search. Issue
  • hybrid_search functionality that does a combination of vector_search and text_search in an order specified by the user. Issue
  • Ability to call and manage OpenAI embeddings for comparison purposes. Issue
  • Save vector_search history for downstream monitoring of model performance. Issue
  • Perform chunking on the DB with multiple langchain splitters. Issue

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pgml-0.8.1-cp311-none-win_amd64.whl (2.8 MB view hashes)

Uploaded CPython 3.11 Windows x86-64

pgml-0.8.1-cp311-cp311-manylinux_2_34_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

pgml-0.8.1-cp311-cp311-manylinux_2_34_aarch64.whl (2.6 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.34+ ARM64

pgml-0.8.1-cp311-cp311-manylinux_2_28_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

pgml-0.8.1-cp311-cp311-macosx_11_0_arm64.whl (2.5 MB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

pgml-0.8.1-cp311-cp311-macosx_10_7_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.11 macOS 10.7+ x86-64

pgml-0.8.1-cp310-none-win_amd64.whl (2.8 MB view hashes)

Uploaded CPython 3.10 Windows x86-64

pgml-0.8.1-cp310-cp310-manylinux_2_34_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

pgml-0.8.1-cp310-cp310-manylinux_2_34_aarch64.whl (2.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.34+ ARM64

pgml-0.8.1-cp310-cp310-manylinux_2_28_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

pgml-0.8.1-cp310-cp310-macosx_11_0_arm64.whl (2.5 MB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

pgml-0.8.1-cp310-cp310-macosx_10_7_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.10 macOS 10.7+ x86-64

pgml-0.8.1-cp39-none-win_amd64.whl (2.8 MB view hashes)

Uploaded CPython 3.9 Windows x86-64

pgml-0.8.1-cp39-cp39-manylinux_2_34_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

pgml-0.8.1-cp39-cp39-manylinux_2_34_aarch64.whl (2.6 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.34+ ARM64

pgml-0.8.1-cp39-cp39-manylinux_2_28_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

pgml-0.8.1-cp39-cp39-macosx_11_0_arm64.whl (2.5 MB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

pgml-0.8.1-cp39-cp39-macosx_10_7_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.9 macOS 10.7+ x86-64

pgml-0.8.1-cp38-none-win_amd64.whl (2.8 MB view hashes)

Uploaded CPython 3.8 Windows x86-64

pgml-0.8.1-cp38-cp38-manylinux_2_34_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.34+ x86-64

pgml-0.8.1-cp38-cp38-manylinux_2_34_aarch64.whl (2.6 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.34+ ARM64

pgml-0.8.1-cp38-cp38-manylinux_2_28_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

pgml-0.8.1-cp38-cp38-macosx_11_0_arm64.whl (2.5 MB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

pgml-0.8.1-cp38-cp38-macosx_10_7_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.8 macOS 10.7+ x86-64

pgml-0.8.1-cp37-cp37m-manylinux_2_34_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.34+ x86-64

pgml-0.8.1-cp37-cp37m-manylinux_2_34_aarch64.whl (2.6 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.34+ ARM64

pgml-0.8.1-cp37-cp37m-manylinux_2_28_x86_64.whl (2.8 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.28+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page