Skip to main content

The Python Unibase. Build your vector database from working as a library to scaling as a database in the cloud

Project description

the unibasedb vector database component

PyPI PyPI - Downloads from official pypistats

unibase is a vector database offering a comprehensive suite of CRUD (Create, Read, Update, Delete) operations and robust scalability options scaling-your-db, including sharding and replication. It is deployable across various environments: from local development to on-premise servers and the cloud.

By leveraging the power of DocArray for vector logic and a powerful serving layer, unibasedb provides a lean, Pythonic design tailored for performance without unnecessary complexity.

🚀 Install

pip install unibasedb
Use unibasedb locally Use unibasedb as a service

🎯 Getting started with unibase locally

This example demonstrates how to use unibasedb to build a Book Recommendation Agent. The agent retrieves similar books based on a user's query, highlighting the dynamic, reasoning-based applications of unibasedb.

Step 1: Define a Document Schema

We begin by defining the schema for our data using DocArray. In this example, our data consists of books with attributes such as title, author, description, and a vector embedding.

from docarray import BaseDoc
from docarray.typing import NdArray

# Define a schema for books
class BookDoc(BaseDoc):
    title: str  # Title of the book
    author: str  # Author of the book
    description: str  # A brief description of the book
    embedding: NdArray[128]  # 128-dimensional embedding for the book

This schema lays the foundation for how data will be stored and queried in unibasedb.


Step 2: Initialize the Database and Index Data

Next, we initialize a database and populate it with some simulated book data.

from docarray import DocList
import numpy as np
from unibasedb import InMemoryExactNNUnibase

# Step 1: Initialize the database
db = InMemoryExactNNUnibase[BookDoc](workspace='./book_workspace')

# Step 2: Generate book data and index it
book_list = [
    BookDoc(
        title=f"Book {i}",
        author=f"Author {chr(65 + i % 26)}",  # Rotate through letters A-Z
        description=f"A fascinating story of Book {i}.",
        embedding=np.random.rand(128)  # Simulated embedding
    )
    for i in range(100)  # Create 100 books
]
db.index(inputs=DocList[BookDoc](book_list))

Explanation:

  1. Database Initialization:
    • We use InMemoryExactNNUnibase to create an in-memory database. The workspace parameter specifies where data is stored.
  2. Indexing Data:
    • A list of 100 books with fake data (random titles, authors, and embeddings) is created and indexed into the database.

Step 3: Simulate a Book Recommendation Agent

We create a simple agent that accepts a user query and retrieves similar books from the database.

# Step 3: Simulate an AI agent
class BookRecommendationAgent:
    def __init__(self, database):
        self.database = database

    def recommend_books(self, query_text: str, query_embedding: np.ndarray, limit=5):
        # Simulate reasoning: Query the database for recommendations
        query_doc = BookDoc(
            title="User Query",
            author="N/A",
            description=query_text,
            embedding=query_embedding
        )
        results = self.database.search(inputs=DocList[BookDoc]([query_doc]), limit=limit)
        
        # Process results
        recommendations = [
            {
                "title": result.title,
                "author": result.author,
                "description": result.description
            }
            for result in results[0].matches
        ]
        return recommendations

Explanation:

  • The BookRecommendationAgent encapsulates logic for querying the database and processing results.
  • It takes a user's query (text and embedding) and searches the database for similar books.

Step 4: Query the Agent with User Input

Finally, we simulate user input and use the agent to retrieve recommendations.

# Step 4: Use the agent
agent = BookRecommendationAgent(db)

# Simulated user input
user_query = "A gripping tale of adventure and discovery."
user_embedding = np.random.rand(128)  # Simulated embedding for the query

recommendations = agent.recommend_books(query_text=user_query, query_embedding=user_embedding, limit=3)

# Step 5: Display recommendations
print("Recommended books:")
for i, rec in enumerate(recommendations, start=1):
    print(f"{i}. Title: {rec['title']}, Author: {rec['author']}, Description: {rec['description']}")

Explanation:

  1. User Input:
    • The user provides a query (e.g., "A gripping tale of adventure and discovery") and a simulated embedding.
  2. Recommendations:
    • The agent queries the database and retrieves the top 3 similar books based on the embedding.
  3. Display Results:
    • The results are formatted and printed for the user.

Getting started with unibase as a service

unibasedb is designed to be easily served as a service, supporting gRPC, HTTP, and Websocket communication protocols.

Server Side

Server Side Example: A Book Recommendation Database. On the server side, you would start the service as follows.

from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np

from unibasedb import InMemoryExactNNUnibase

# Define a Document schema for book information
class BookDoc(BaseDoc):
    title: str
    author: str
    description: str
    embedding: NdArray[128]  # Example: A 128-dimensional vector representing the book's content

# Initialize the database
db = InMemoryExactNNUnibase[BookDoc](workspace='./books_workspace')

# Generate fake data for books and index it
book_list = [
    BookDoc(
        title=f"Book {i}",
        author=f"Author {chr(65 + i % 26)}",
        description=f"A fascinating description of Book {i}.",
        embedding=np.random.rand(128)  # Random embeddings for demonstration
    )
    for i in range(100)  # Simulate 100 books
]

db.index(inputs=DocList[BookDoc](book_list))

# Serve the database as a gRPC service
if __name__ == '__main__':
    print("Starting the Book Recommendation Database...")
    with db.serve(protocol='grpc', port=12345, replicas=1, shards=1) as service:
        print("Book Recommendation Database is running on gRPC://localhost:12345")
        print("You can now query the database from a client.")
        service.block()

This command starts unibase as a service on port 12345, using the gRPC protocol with 1 replica and 1 shard.

Client Side

Once the Book Recommendation Database is running as a service on the server, you can access it from a client application. Here's how to query the service for recommendations:

from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
from unibasedb import Client

# Define the same schema used on the server
class BookDoc(BaseDoc):
    title: str
    author: str
    description: str
    embedding: NdArray[128]

# Instantiate a client connected to the server
# Replace '0.0.0.0' with the actual IP address of the server
client = Client[BookDoc](address='grpc://0.0.0.0:12345')

# Create a query book
query_book = BookDoc(
    title="User Query",
    author="N/A",
    description="An epic story of friendship and courage.",
    embedding=np.random.rand(128)  # Simulated query embedding
)

# Perform a search query
results = client.search(inputs=DocList[BookDoc]([query_book]), limit=5)

# Display the search results
print("Top 5 similar books:")
for match in results[0].matches:
    print(f"Title: {match.title}, Author: {match.author}, Description: {match.description}")

Explanation of the Code:

  1. Schema Definition:

    • The client defines the same BookDoc schema used on the server to ensure compatibility.
  2. Connecting to the Server:

    • The Client connects to the grpc://0.0.0.0:12345 address. Replace 0.0.0.0 with the server's actual IP address if running on a remote machine.
  3. Query Creation:

    • The query_book represents a user's input, including a description and an embedding that simulates the query vector.
  4. Performing the Search:

    • The search method sends the query to the server and retrieves the top 5 similar books.
  5. Displaying Results:

    • The retrieved matches are printed, showing the titles, authors, and descriptions of the recommended books.

Advanced Topics

What is a vector database?

A vector database is a specialized type of database designed to store, manage, and retrieve vector embeddings—numerical representations of data such as text, images, audio, or other complex objects. Unlike traditional databases that rely on exact matches or keyword searches, vector databases excel at performing similarity searches. They use advanced algorithms to find data points that are semantically or contextually similar to a given query, even if the exact match doesn't exist.

CRUD Support

Both local library usage and client-server interactions in unibase share the same API, providing index, search, update, and delete functionalities:

  • Index: Accepts a DocList to index.
  • Search: Takes a DocList of batched queries or a single BaseDoc as a query. Returns results with matches and scores, sorted by relevance.
  • Delete: Accepts a DocList of documents to remove from the index. Only the id attribute is required, so ensure you track indexed IDs for deletion.
  • Update: Replaces existing documents in the index with new attributes and payloads from the input DocList.

Service Endpoint Configuration

You can configure and serve unibase with the following parameters:

  • Protocol: The communication protocol, which can be gRPC, HTTP, websocket, or a combination. Default is gRPC.
  • Port: The port(s) for accessing the service. Can be a single port or a list of ports for multiple protocols. Default is 8081.
  • Workspace: The directory where the database persists its data. Default is the current directory (.).

Scaling Your Database

unibase supports two key scaling parameters for deployment:

  • Shards: The number of data shards. This reduces latency by ensuring documents are indexed in only one shard. Search queries are distributed across all shards, and results are merged.
  • Replicas: The number of database replicas. Using the RAFT algorithm, unibase synchronizes indexes across replicas, improving availability and search throughput.

Vector Search Configuration

InMemoryExactNNUnibase

This database performs exact nearest neighbor searches with minimal configuration:

  • Workspace: The directory where data is stored.
InMemoryExactNNUnibase[MyDoc](workspace='./unibasedb')
InMemoryExactNNUnibase[MyDoc].serve(workspace='./unibasedb')

HNSWUnibase

This database uses the HNSW (Hierarchical Navigable Small World) algorithm from HNSWLib for approximate nearest neighbor searches. It offers several configurable parameters:

  • Workspace: The directory for storing and persisting data.
  • Space: The similarity metric (l2, ip, or cosine). Default is l2.
  • Max Elements: The initial index capacity, which can grow dynamically. Default is 1024.
  • ef_construction: Controls the speed/accuracy trade-off during index construction. Default is 200.
  • ef: Controls the query time/accuracy trade-off. Default is 10.
  • M: The maximum number of outgoing connections in the graph. Default is 16.
  • allow_replace_deleted: Enables replacement of deleted elements. Default is False.
  • num_threads: The number of threads used for index and search operations. Default is 1.

Command Line Interface

unibase includes a straightforward CLI for serving and deploying your database:

Description Command
Serve your DB locally unibasedb serve --db example:db

Features

  • User-Friendly Interface: Designed for simplicity, unibase caters to users of all skill levels.
  • Minimalistic Design: Focuses on essential features, ensuring smooth transitions between local, server, and cloud environments.
  • Full CRUD Support: Comprehensive support for indexing, searching, updating, and deleting operations.
  • DB as a Service: Supports gRPC, HTTP, and Websocket protocols for efficient database serving and operations.
  • Scalability: Features like sharding and replication enhance performance, availability, and throughput.
  • Serverless Capability: Supports serverless deployment for optimal resource utilization.
  • Multiple ANN Algorithms: Offers a variety of Approximate Nearest Neighbor (ANN) algorithms, including:
    • InMemoryExactNNUnibase: For exact nearest neighbor searches.
    • HNSWUnibase: Based on the HNSW algorithm for efficient approximate searches.

Roadmap

We have exciting plans for unibase! Here’s what’s in the pipeline:

  • More ANN Algorithms: Expanding support for additional ANN search algorithms.
  • Enhanced Filtering: Improving filtering capabilities for more precise searches.
  • Customizability: Making unibase highly customizable to meet specific user needs.
  • Expanded Serverless Capacity: Enhancing serverless deployment options in the cloud.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unibasedb-0.0.1.tar.gz (26.6 kB view details)

Uploaded Source

File details

Details for the file unibasedb-0.0.1.tar.gz.

File metadata

  • Download URL: unibasedb-0.0.1.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for unibasedb-0.0.1.tar.gz
Algorithm Hash digest
SHA256 8c34849457a863d7950b8fbbdb6a578f559ca3bd2e7db8f057b742ebb6badd01
MD5 fc7bfbcedad913175ab55722547c5607
BLAKE2b-256 4ba295855474d7595c3af0da44c6dc572935db586489bf6fd07bb744a80d454e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page