The Python Unibase. Build your vector database from working as a library to scaling as a database in the cloud
Project description
the unibasedb vector database component
unibase is a vector database offering a comprehensive suite of CRUD (Create, Read, Update, Delete) operations and robust scalability options scaling-your-db, including sharding and replication. It is deployable across various environments: from local development to on-premise servers and the cloud.
By leveraging the power of DocArray for vector logic and a powerful serving layer, unibasedb provides a lean, Pythonic design tailored for performance without unnecessary complexity.
🚀 Install
pip install unibasedb
|
|
|
🎯 Getting started with unibase locally
This example demonstrates how to use unibasedb to build a Book Recommendation Agent. The agent retrieves similar books based on a user's query, highlighting the dynamic, reasoning-based applications of unibasedb.
Step 1: Define a Document Schema
We begin by defining the schema for our data using DocArray. In this example, our data consists of books with attributes such as title, author, description, and a vector embedding.
from docarray import BaseDoc
from docarray.typing import NdArray
# Define a schema for books
class BookDoc(BaseDoc):
title: str # Title of the book
author: str # Author of the book
description: str # A brief description of the book
embedding: NdArray[128] # 128-dimensional embedding for the book
This schema lays the foundation for how data will be stored and queried in unibasedb.
Step 2: Initialize the Database and Index Data
Next, we initialize a database and populate it with some simulated book data.
from docarray import DocList
import numpy as np
from unibasedb import InMemoryExactNNUnibase
# Step 1: Initialize the database
db = InMemoryExactNNUnibase[BookDoc](workspace='./book_workspace')
# Step 2: Generate book data and index it
book_list = [
BookDoc(
title=f"Book {i}",
author=f"Author {chr(65 + i % 26)}", # Rotate through letters A-Z
description=f"A fascinating story of Book {i}.",
embedding=np.random.rand(128) # Simulated embedding
)
for i in range(100) # Create 100 books
]
db.index(inputs=DocList[BookDoc](book_list))
Explanation:
- Database Initialization:
- We use
InMemoryExactNNUnibaseto create an in-memory database. Theworkspaceparameter specifies where data is stored.
- We use
- Indexing Data:
- A list of 100 books with fake data (random titles, authors, and embeddings) is created and indexed into the database.
Step 3: Simulate a Book Recommendation Agent
We create a simple agent that accepts a user query and retrieves similar books from the database.
# Step 3: Simulate an AI agent
class BookRecommendationAgent:
def __init__(self, database):
self.database = database
def recommend_books(self, query_text: str, query_embedding: np.ndarray, limit=5):
# Simulate reasoning: Query the database for recommendations
query_doc = BookDoc(
title="User Query",
author="N/A",
description=query_text,
embedding=query_embedding
)
results = self.database.search(inputs=DocList[BookDoc]([query_doc]), limit=limit)
# Process results
recommendations = [
{
"title": result.title,
"author": result.author,
"description": result.description
}
for result in results[0].matches
]
return recommendations
Explanation:
- The
BookRecommendationAgentencapsulates logic for querying the database and processing results. - It takes a user's query (text and embedding) and searches the database for similar books.
Step 4: Query the Agent with User Input
Finally, we simulate user input and use the agent to retrieve recommendations.
# Step 4: Use the agent
agent = BookRecommendationAgent(db)
# Simulated user input
user_query = "A gripping tale of adventure and discovery."
user_embedding = np.random.rand(128) # Simulated embedding for the query
recommendations = agent.recommend_books(query_text=user_query, query_embedding=user_embedding, limit=3)
# Step 5: Display recommendations
print("Recommended books:")
for i, rec in enumerate(recommendations, start=1):
print(f"{i}. Title: {rec['title']}, Author: {rec['author']}, Description: {rec['description']}")
Explanation:
- User Input:
- The user provides a query (e.g., "A gripping tale of adventure and discovery") and a simulated embedding.
- Recommendations:
- The agent queries the database and retrieves the top 3 similar books based on the embedding.
- Display Results:
- The results are formatted and printed for the user.
Getting started with unibase as a service
unibasedb is designed to be easily served as a service, supporting gRPC, HTTP, and Websocket communication protocols.
Server Side
Server Side Example: A Book Recommendation Database. On the server side, you would start the service as follows.
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
from unibasedb import InMemoryExactNNUnibase
# Define a Document schema for book information
class BookDoc(BaseDoc):
title: str
author: str
description: str
embedding: NdArray[128] # Example: A 128-dimensional vector representing the book's content
# Initialize the database
db = InMemoryExactNNUnibase[BookDoc](workspace='./books_workspace')
# Generate fake data for books and index it
book_list = [
BookDoc(
title=f"Book {i}",
author=f"Author {chr(65 + i % 26)}",
description=f"A fascinating description of Book {i}.",
embedding=np.random.rand(128) # Random embeddings for demonstration
)
for i in range(100) # Simulate 100 books
]
db.index(inputs=DocList[BookDoc](book_list))
# Serve the database as a gRPC service
if __name__ == '__main__':
print("Starting the Book Recommendation Database...")
with db.serve(protocol='grpc', port=12345, replicas=1, shards=1) as service:
print("Book Recommendation Database is running on gRPC://localhost:12345")
print("You can now query the database from a client.")
service.block()
This command starts unibase as a service on port 12345, using the gRPC protocol with 1 replica and 1 shard.
Client Side
Once the Book Recommendation Database is running as a service on the server, you can access it from a client application. Here's how to query the service for recommendations:
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
from unibasedb import Client
# Define the same schema used on the server
class BookDoc(BaseDoc):
title: str
author: str
description: str
embedding: NdArray[128]
# Instantiate a client connected to the server
# Replace '0.0.0.0' with the actual IP address of the server
client = Client[BookDoc](address='grpc://0.0.0.0:12345')
# Create a query book
query_book = BookDoc(
title="User Query",
author="N/A",
description="An epic story of friendship and courage.",
embedding=np.random.rand(128) # Simulated query embedding
)
# Perform a search query
results = client.search(inputs=DocList[BookDoc]([query_book]), limit=5)
# Display the search results
print("Top 5 similar books:")
for match in results[0].matches:
print(f"Title: {match.title}, Author: {match.author}, Description: {match.description}")
Explanation of the Code:
-
Schema Definition:
- The client defines the same
BookDocschema used on the server to ensure compatibility.
- The client defines the same
-
Connecting to the Server:
- The
Clientconnects to thegrpc://0.0.0.0:12345address. Replace0.0.0.0with the server's actual IP address if running on a remote machine.
- The
-
Query Creation:
- The
query_bookrepresents a user's input, including a description and an embedding that simulates the query vector.
- The
-
Performing the Search:
- The
searchmethod sends the query to the server and retrieves the top 5 similar books.
- The
-
Displaying Results:
- The retrieved matches are printed, showing the titles, authors, and descriptions of the recommended books.
Advanced Topics
What is a vector database?
A vector database is a specialized type of database designed to store, manage, and retrieve vector embeddings—numerical representations of data such as text, images, audio, or other complex objects. Unlike traditional databases that rely on exact matches or keyword searches, vector databases excel at performing similarity searches. They use advanced algorithms to find data points that are semantically or contextually similar to a given query, even if the exact match doesn't exist.
CRUD Support
Both local library usage and client-server interactions in unibase share the same API, providing index, search, update, and delete functionalities:
- Index: Accepts a
DocListto index. - Search: Takes a
DocListof batched queries or a singleBaseDocas a query. Returns results withmatchesandscores, sorted by relevance. - Delete: Accepts a
DocListof documents to remove from the index. Only theidattribute is required, so ensure you track indexed IDs for deletion. - Update: Replaces existing documents in the index with new attributes and payloads from the input
DocList.
Service Endpoint Configuration
You can configure and serve unibase with the following parameters:
- Protocol: The communication protocol, which can be
gRPC,HTTP,websocket, or a combination. Default isgRPC. - Port: The port(s) for accessing the service. Can be a single port or a list of ports for multiple protocols. Default is 8081.
- Workspace: The directory where the database persists its data. Default is the current directory (
.).
Scaling Your Database
unibase supports two key scaling parameters for deployment:
- Shards: The number of data shards. This reduces latency by ensuring documents are indexed in only one shard. Search queries are distributed across all shards, and results are merged.
- Replicas: The number of database replicas. Using the RAFT algorithm,
unibasesynchronizes indexes across replicas, improving availability and search throughput.
Vector Search Configuration
InMemoryExactNNUnibase
This database performs exact nearest neighbor searches with minimal configuration:
- Workspace: The directory where data is stored.
InMemoryExactNNUnibase[MyDoc](workspace='./unibasedb')
InMemoryExactNNUnibase[MyDoc].serve(workspace='./unibasedb')
HNSWUnibase
This database uses the HNSW (Hierarchical Navigable Small World) algorithm from HNSWLib for approximate nearest neighbor searches. It offers several configurable parameters:
- Workspace: The directory for storing and persisting data.
- Space: The similarity metric (
l2,ip, orcosine). Default isl2. - Max Elements: The initial index capacity, which can grow dynamically. Default is 1024.
- ef_construction: Controls the speed/accuracy trade-off during index construction. Default is 200.
- ef: Controls the query time/accuracy trade-off. Default is 10.
- M: The maximum number of outgoing connections in the graph. Default is 16.
- allow_replace_deleted: Enables replacement of deleted elements. Default is
False. - num_threads: The number of threads used for
indexandsearchoperations. Default is 1.
Command Line Interface
unibase includes a straightforward CLI for serving and deploying your database:
| Description | Command |
|---|---|
| Serve your DB locally | unibasedb serve --db example:db |
Features
- User-Friendly Interface: Designed for simplicity,
unibasecaters to users of all skill levels. - Minimalistic Design: Focuses on essential features, ensuring smooth transitions between local, server, and cloud environments.
- Full CRUD Support: Comprehensive support for indexing, searching, updating, and deleting operations.
- DB as a Service: Supports
gRPC,HTTP, andWebsocketprotocols for efficient database serving and operations. - Scalability: Features like sharding and replication enhance performance, availability, and throughput.
- Serverless Capability: Supports serverless deployment for optimal resource utilization.
- Multiple ANN Algorithms: Offers a variety of Approximate Nearest Neighbor (ANN) algorithms, including:
- InMemoryExactNNUnibase: For exact nearest neighbor searches.
- HNSWUnibase: Based on the HNSW algorithm for efficient approximate searches.
Roadmap
We have exciting plans for unibase! Here’s what’s in the pipeline:
- More ANN Algorithms: Expanding support for additional ANN search algorithms.
- Enhanced Filtering: Improving filtering capabilities for more precise searches.
- Customizability: Making
unibasehighly customizable to meet specific user needs. - Expanded Serverless Capacity: Enhancing serverless deployment options in the cloud.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file unibasedb-0.0.1.tar.gz.
File metadata
- Download URL: unibasedb-0.0.1.tar.gz
- Upload date:
- Size: 26.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c34849457a863d7950b8fbbdb6a578f559ca3bd2e7db8f057b742ebb6badd01
|
|
| MD5 |
fc7bfbcedad913175ab55722547c5607
|
|
| BLAKE2b-256 |
4ba295855474d7595c3af0da44c6dc572935db586489bf6fd07bb744a80d454e
|