Skip to main content

A package for text similarity and embeddings

Project description

Vector Nest 🪺

Website YouTube Channel GitHub Page LinkedIn Page Instagram Page

Installation

pip install vector_db

⚡ Project Details

The project is a database management system for handling vector embeddings and metadata. The main functionalities include creating a database, adding data (in the form of text, metadata, and embeddings), and performing queries based on cosine similarity. This project is ideal for use in AI applications where you need to search, filter, and organize large amounts of vector data.

Key Features:

  • Create a database: You can create a new database with either overwrite or append mode.
  • Create a collection: Define a collection to store documents (such as research papers) and their embeddings.
  • Add data: Add synthetic or real data to the collection, including associated metadata and vector embeddings.
  • Search and retrieve: Use cosine similarity to retrieve the most relevant documents to a query. Filters such as author or category can be applied.
  • Advanced queries: Support for setting a similarity threshold to filter out low-relevance results.

Example Usage

⚡ Creating and adding data to a collection:

import random

# Initialize the VectorNest
manager = VectorNest()

# Step 1: Create a database named 'research_database' with mode='overwrite' or 'append'
db_name = 'research_database'
manager.create_database(db_name, mode='overwrite')  # Use 'overwrite' to start fresh or 'append' to keep existing data

db_name = 'research_database'
manager.use_database(db_name)

# Step 2: Create a collection for storing research papers, with mode='overwrite' or 'append'
collection_name = 'research_papers'
manager.create_collection(collection_name, mode='overwrite')  # 'overwrite' replaces existing collection, 'append' keeps it if it exists

# Step 3: Generate synthetic data for research papers
authors = ["Alice Johnson", "Bob Smith", "Carol Lee", "David Wu", "Eve Brown"]
categories = ["AI", "Data Science", "Quantum Computing", "Cybersecurity", "Blockchain"]
publication_years = [2019, 2020, 2021, 2022, 2023]

def generate_fake_abstract(category):
    return f"This paper discusses advancements in {category}. It covers recent trends, methodologies, and potential future applications."

# Step 4: Add synthetic research papers to the collection
for i in range(50):  # Adding 50 synthetic papers
    title = f"Research Paper {i+1}"
    category = random.choice(categories)
    author = random.choice(authors)
    year = random.choice(publication_years)
    abstract = generate_fake_abstract(category)
    
    metadata = {
        "title": title,
        "author": author,
        "year": str(year),
        "category": category
    }
    manager.add_to_collection(collection_name, text=abstract, metadata=metadata)

⚡ Retrieving from collection:

Example 1: Retrieve top 5 research papers similar to a specific topic, filtering by category

query_text = "advancements in AI"
filters = {"category": "AI"}
top_n = 5
retrieved_texts = manager.retrieve_from_collection(collection_name, query_text, filters=filters, top_n=top_n)

print("\nTop 5 research papers similar to the query in the 'AI' category:")
for result in retrieved_texts:
    print(f"Text: {result['text']}\nMetadata: {result['metadata']}\nSimilarity: {result['similarity']}\n")

⚡ Close the database connection:

manager.close_connection()

⚡ Close the database connection:

Example 2. Identify the most similar research papers in the entire collection, regardless of category, with a high similarity threshold

db_name = 'research_database'
collection_name = 'research_papers'

manager.use_database(db_name)


query_text = "applications of blockchain in security"
filters = {'author': 'Carol Lee'}
top_n = 5
threshold = 0.01
retrieved_texts = manager.retrieve_from_collection(collection_name, query_text, filters=filters, top_n=top_n, threshold=threshold)

print("\nTop 5 research papers related to 'blockchain' with high similarity:")
for result in retrieved_texts:
    print(f"Text: {result['text']}\nMetadata: {result['metadata']}\nSimilarity: {result['similarity']}\n")

More Information

For a detailed explanation and walkthrough of this project, check out the blog post on my website:

Link to Blog Post

You can also watch the YouTube video on this project for further understanding:

YouTube Video Link

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vector_nest-0.1.0.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vector_nest-0.1.0-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file vector_nest-0.1.0.tar.gz.

File metadata

  • Download URL: vector_nest-0.1.0.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for vector_nest-0.1.0.tar.gz
Algorithm Hash digest
SHA256 557ed6e6c239beaca82575664ba724196d91abdb175b2472a0e58882037592a1
MD5 ec18a1cf536fd49c80e7a8c4f59c5d50
BLAKE2b-256 a632b1614f6e7b4b6986541987ba4a38e9f59b150054ded5f9f7b27f1cd222f0

See more details on using hashes here.

File details

Details for the file vector_nest-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vector_nest-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for vector_nest-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 50d4ee821c2264c9780094fe9e59c78b32bbdd2240712ea5262714be3a22f816
MD5 4f66a8ddf3ab74954551366ee126b270
BLAKE2b-256 12d92338488ad817ba7b4e4705c199a4d3f1f8ece5c817f2413c2ef368542bd8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page