A modular Python package for efficient embedding workflows and PostgreSQL-based vector store management with parent-child relationships.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

chATLAS Embed

ATLAS PostgreSQL Python HuggingFace

A modular Python package for efficient embedding workflows and PostgreSQL-based vector store management with parent-child relationships.

📚 Overview

chATLAS_Embed provides a flexible framework for creating and managing vector stores with PostgreSQL backend support. The package excels at document processing tasks through its modular architecture, enabling custom implementations of embedding models, vector stores, and text splitters. Originally designed at the ATLAS experiment at CERN for the chATLAS project.

🌟 Key Features

PostgreSQL vector extension compatibility for robust database operations
Hierarchical document storage with parent-child relationships
Extensible architecture supporting custom components:
- Text splitters
- Embedding models
- Vector stores
- Vector store creators
Comprehensive RAG pipeline examples
Seamless Langchain integration

🚀 Quick Start

📥 Installation

cd chATLAS_Embed
uv sync

💡 Basic Usage - Connecting to an existing vectorstore

from chATLAS_Embed import PostgresParentChildVectorStore
from chATLAS_Embed.EmbeddingModels import SentenceTransformerEmbedding

# Initialize embedding model
embedding_model = SentenceTransformerEmbedding(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Set up vector store
vector_store = PostgresParentChildVectorStore(
    connection_string="postgresql://{user}:{password}@{PostgresServer}:{port}/{database_name}",
    embedding_model=embedding_model
)

# Search through the vectorstore
vector_store.search("What is the ATLAS Experiment?")

vector_store.close()  # Make sure to close the vectorstore once done with it (automatically closes when process ends)

🚀 Getting Started Locally

For a complete local setup with examples, see the quickstart/ directory:

cd chATLAS_Embed/quickstart
# Follow the README.md for step-by-step instructions

The quickstart includes:

Complete setup guide with Docker
Working example scripts
Database initialization
Troubleshooting tips

Installation
Requirements
Usage
Hybrid Retrieval
Extending chATLAS_Embed
Current Implementations
Project Structure and Imports
Change Log

Before starting, when using local DB instances

Install postgresql and start PostgreSQL service (on Linux)

sudo apt install postgresql postgresql-contrib
sudo systemctl start postgresql

Install pgvector following the instructions provided in the repository:

cd /tmp
git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install

Then connect to the psql shell:

sudo -u postgres psql

psql -U postgres -h localhost

and for each database where you want the extension to be used, run the following command:

CREATE DATABASE my_new_database;
\c my_new_database # change to the name of your DB
CREATE EXTENSION vector;

At this point one can use the python classes provided in this package to manipulate the database. Notice that it is possible to operate on the database with a user different from the default one (postgres). To be able to do so, connect to the psql shell and create a new user and grant privileges on the database of interest:

CREATE USER myuser WITH ENCRYPTED PASSWORD 'mypassword';
GRANT ALL PRIVILEGES ON DATABASE my_vector_db TO myuser;

Usage

A. Creating a Vector Store

1. Initialize Components

Define the key components for creating a vector store, including the embedding model, text splitters, and the vector store itself.

from chATLAS_Embed import RecursiveTextSplitter, PostgresParentChildVectorStore
from chATLAS_Embed.EmbeddingModels import SentenceTransformerEmbedding

# Define the embedding model
embedding_model = SentenceTransformerEmbedding(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Define text splitters for parent and child documents
parent_splitter = RecursiveTextSplitter(chunk_size=2048, chunk_overlap=24)
child_splitter = RecursiveTextSplitter(chunk_size=256, chunk_overlap=24)

# Initialize the vector store
vector_store = PostgresParentChildVectorStore(
    connection_string="postgresql://{user}:{password}@{PostgresServer}:{port}/{database_name}",
    embedding_model=embedding_model
)

2. Create and Populate the Vector Store

Use the VectorStoreCreator with a document loader to populate the vector store with documents.

from chATLAS_Embed import VectorStoreCreator, DocumentChunker
from chATLAS_Embed.DocumentLoaders import TWikiTextDocumentLoader
from pathlib import Path

# Initialize the document loader and chunker
document_loader = TWikiTextDocumentLoader()
document_chunker = DocumentChunker(
    parent_splitter=parent_splitter,
    child_splitter=child_splitter
)
twiki_creator = VectorStoreCreator(
    vector_store=vector_store,
    document_loader=document_loader,
    document_chunker=document_chunker,
    output_dir=Path("./output")
)

# Process and add documents to the vector store
twiki_creator.create_update_vectorstore(
    input_path=Path("path/to/documents"),
    update=True,  # Update existing documents if newer versions exist
    verbose=True  # Print details of documents added or updated
)

B. Querying an Existing Database

1. Connect to the Database

Establish a connection to the existing vector store.

db = PostgresParentChildVectorStore(
    connection_string="postgresql://{user}:{password}@{PostgresServer}:{port}/{database_name}",
    embedding_model=embedding_model
)

2. Search the Database

Simple Search

Retrieve the top k most relevant documents for a given query and top k_text documents from the hybrid text searching.

results = db.search("What is the ATLAS experiment?", k=4, k_text=2)

Advanced Search

Filter search results by metadata and date.

results = db.search(
    query="What is the ATLAS experiment?",
    k=4,
    date_filter="01-09-2014",  # Retrieve documents modified after this date (dd-mm-YYYY)
    metadata_filters={
        "type": ["twiki", "CDS"],  # Either a twiki doc or a CDS doc
        "name": "Document Name",
        "url": "URL to webpage",
        "last_modification": "dd-mm-YYYY",  # Retrieve documents from this specific month
        "category": ["CERN", "ATLAS"]  # If your documents have a category metadata setup can search for containment.
    }
)

C. Closing the database properly

# Option 1: Explicit cleanup
vector_store = PostgresParentChildVectorStore(connection_string, embedding_model)
# Use vector_store
vector_store.close()  # Important!

# Option 2: Delete the vectorstore object if it won't be used again
vector_store = PostgresParentChildVectorStore(connection_string, embedding_model)
# Use vector_store
del vector_store  # Important!


# Option 2: Context manager
with PostgresParentChildVectorStore(connection_string, embedding_model) as vector_store:
    # Use vector_store
    # Connections automatically closed when exiting the block

D. Integration with LangChain

Easily integrate the vector store into a LangChain pipeline.

from chATLAS_Embed import LangChainVectorStore

# Convert the vector store for use with LangChain
db_langchain = LangChainVectorStore(vector_store=db)

docs = db_langchain.invoke("What is ATLAS?", config={"metadata": {"k": 2, "k_text": 2, "date_filter": "01-10-2020"}})

This then works the 'same' as a regular langchain retriever, though not all implementations have been tested.

Note that documents retrieved from the langchain vector store are only the parent documents. The similarity for each document is stored in its metadata by:

docs[0].metadata["similarity"]

Notes:

Replace placeholders like {user}, {password}, {PostgresServer}, {port}, and {database_name} with actual connection details.
Ensure documents have metadata (e.g., last_modification) defined when using filters.
Refer to details section below for full example integration with langchain.

Full RAG Pipeline w/ Langchain Integration

Full RAG Pipeline example using Langchian wrapper

"""
Example usage of the DB with a full Langchain RAG chain

USES Memory in its responses storing currently just the previous answers given by the LLM for now
"""
# IMPORTS
from pathlib import Path

from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationSummaryBufferMemory
from langchain.prompts import PromptTemplate

from chATLAS_Embed.EmbeddingModels import SentenceTransformerEmbedding
from chATLAS_Embed import LangChainVectorStore, PostgresParentChildVectorStore




def create_rag_chain(vectorstore: LangChainVectorStore) -> ConversationalRetrievalChain:
   """Create a RAG chain using the vector store."""

   # Initialize language model
   llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

   # Create conversation memory
   memory = ConversationSummaryBufferMemory(
      memory_key="chat_history",
      return_messages=True,
      output_key="answer",
      llm=llm
   )

   # Create custom prompts
   qa_template = """You are a helpful particle physics assistant at the ATLAS experiment in the Large Hadron Collider. \
Your colleagues are asking questions about information contained in the documents of ATLAS experiment. \
You will be given selected paragraphs from ATLAS documents. You need to provide a comprehensive summary of the given
 text that can be useful to help answering this question. The summary should cover all the key points and main ideas
 presented in the original text that helps answering the question, while also condensing the information into a concise
and easy-to-understand format. Please ensure that the summary includes all the relevant jargons, values, and as much details as possible, \
while avoiding any unnecessary repetition.\n
    Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context:
    {context}

    Chat History: {chat_history}
    Question: {question}
    Helpful Answer:"""

   qa_prompt = PromptTemplate(
      input_variables=["context", "chat_history", "question"],
      template=qa_template
   )

   # Create the chain
   chain = ConversationalRetrievalChain.from_llm(
      llm=llm,
      retriever=vectorstore,
      memory=memory,
      combine_docs_chain_kwargs={"prompt": qa_prompt},
      return_source_documents=True,
   )

   return chain


def main():
   # Configuration
   DB_CONNECTION = "postgresql://{user}:{password}@{PostgresServer}:{port}/{database_name}"
   OUTPUT_DIR = "./output"


   # Load pre-made vectorstore
   db = PostgresParentChildVectorStore(
      connection_string=DB_CONNECTION,
      embedding_model=SentenceTransformerEmbedding(
         model_name="sentence-transformers/all-MiniLM-L6-v2")
   )
   # Convert to langchain compatible
   vectorstore = LangChainVectorStore(vector_store=db)

   # Create RAG chain
   chain = create_rag_chain(vectorstore)

   # Example usage
   while True:
      question = input("\nEnter your question (or 'quit' to exit): ")
      if question.lower() == 'quit':
         break

      result = chain({"question": question})
      print("\nAnswer:", result["answer"])
      print("\nSources:")
      for doc in result["source_documents"]:
         print(f"- {doc.metadata['parent_metadata']['name']} ({doc.metadata['parent_metadata']['url']})")
         print(f"SIMILARITY: {doc.metadata['similarity']}")
         print("-"*20)
         print(doc.page_content)
         print("-"*30)


if __name__ == "__main__":
   main()

C. `Document` Dataclass

Returned by search and used to adding documents to the db is the Document dataclass (v. similar to langchain document):

@dataclass
class Document:
    """Represents a base document with its page_content and metadata."""
    page_content: str
    metadata: Dict[str, Any]
    id: str
    parent_id: Optional[str] = None

🔍 Hybrid Retrieval

The vector store implements a sophisticated hybrid search combining both semantic and lexical search capabilities for optimal retrieval performance.

Search Types

Semantic (Vector) Search
- Uses cosine similarity between embedded query and document vectors
- Finds documents with similar meaning/context regardless of exact wording
- Similarity scores range 0-1 (1 = highest similarity)
Lexical (Text) Search
- Utilizes PostgreSQL's native text search with tsvector/tsquery
- Implements weighted document ranking:
  - Document names (Weight A - highest priority)
  - Document topics and content (Weight B - medium priority)
- Scores normalized to 0-1 range

Usage

results = vector_store.search(
    query="search query",
    k=4,              # Number of vector search results
    k_text=2,         # Number of text search results
    metadata_filters={
        "type": ["paper", "image"],  # Returns docs of type paper OR image
        "experiment": "ATLAS"        # Exact match filtering
    },
    date_filter="01-01-2023"        # Only docs after this date (dd-mm-yyyy)
)

Results

Returns list of tuples: (child_doc, parent_doc, similarity_score)
Each parent_doc.metadata includes "search_type": "vector" or "text"
Results automatically deduplicated across search types
Final results limited to k + k_text total documents
Similarity scores normalized but not directly comparable between search types

Advanced Features

Fetches extra results initially to handle metadata filtering
Supports multi-value metadata filtering (OR conditions)
Date-based filtering using document modification dates
Text search considers:
- Term proximity and density
- Word stemming and stop word removal
- Document structure weights

Current Implementations

Vector Stores

PostgresParentChildVectorStore

This vector store implements a parent-child structure using PostgreSQL with the pgvector extension, designed for efficient document storage and retrieval. It replaces the previously used Chroma-based solution in Chatlas, addressing issues like the inability to list parent stores.

This implementation aligns with CERN IT standards by leveraging PostgreSQL for scalability and maintainability.

Example Performance Metrics

Using current batching methods, performance benchmarks for processing ~ 22k docs are as follows:

Time to Add Documents:
- Total time: 47 minutes
  - Embedding: ~21 minutes
  - Adding embeddings to the database: ~21 minutes
Storage Requirements (disk):
- Before embeddings: 1.5 GB
- After embeddings: 5.2 GB

Key Features

Click for methods and attributes

add_documents: Inserts parent and child documents into the database. Parent chunks (ID, content, metadata, parent ID) and child chunks (ID, content, metadata, parent ID) are stored in the documents table. Corresponding embeddings (ID, document ID, embedding) are added to the embeddings table. The pgvector extension is required for vector operations.
reindex_database: This can be used to update the index for the current search algorithm (IVFFlat, HNSW, or exact) or change to a different search algorithm. The default search algorithm is IVFFlat. For IVFFlat, calling reindex_database dynamically adjusts the number of lists in the IVFFlat index based on the total number of embeddings. The optimal number of lists is calculated as: optimal_lists = int(math.sqrt(count) + 0.05 * math.sqrt(count)) optimal_probes = int(optimal_lists * 0.14) Running this on the full dataset (~1 million embeddings) takes approximately 165 seconds. Regular optimization is recommended after significant data additions. To switch to HNSW, use reindex_database('HNSW'). Optionally, the desired m and ef_construction parameters can be specified. To switch to an exact search, use reindex_database('exact'). To switch back to IVFFlat, use reindex_database('IVFFlat'). WARNING: Switching the search algorithm and recreating/updating the vector store index affects all users of the vector store and may be disruptive to any ongoing queries. PostgresParentChildVectorStore uses the search algorithm that was last set via reindex_database and this algorithm remains active for all users until it is changed by calling reindex_database.
search: Performs a cosine similarity search using IVFFlat indexing to find the top k documents. Results include the child document, parent document, and similarity score. Supports metadata-based and date-based filtering:
- k: Number of results to return (default: 4).
- metadata_filters: Filters results by metadata tags (e.g., {"name": "AtlasModules"}).
- date_filter: Retrieves documents last modified after a specified date (e.g., "01-04-2010") - in format "dd-mm-YYYY".

VectorStoreCreator and DocumentLoaders

The package now uses a modular composition-based architecture with VectorStoreCreator and DocumentLoader classes:

VectorStoreCreator
- Core class that orchestrates the vector store creation process
- Uses dependency injection to work with any DocumentLoader implementation
- create_update_vectorstore method:
  - Adds documents to the database if they are not already added from a directory of whatever file type
  - If the document is already in the database and update=True param set, will check for every document in the input directory that is also in the dataset if the last_updated metadata key for the document stored is older than the new document in the directory and if there is a new version available it drops all chunks associated with the old document and replaces it with the new ones.

DocumentLoaders

TWikiHTMLDocumentLoader
- Processes HTML documents directly from a directory and loads them into the vectorstore.
- Follows the same method previously used to convert the HTML to text.
TWikiTextDocumentLoader
- Processes plain text TWiki documents.
CDSTextDocumentLoader
- Loads scraped CDS plain text documents.
GitlabMarkdownDocumentLoader
- Processes Markdown documents from GitLab repositories.
syncedTwikiDocumentLoader
- Handles synchronized TWiki documents with specific formatting.

Note: Legacy VectorStoreCreator classes (e.g., TWikiHTMLVectorStoreCreator) are still available but deprecated. Use the new modular approach with VectorStoreCreator + DocumentLoader for new projects.

Embedding Models

SentenceTransformerEmbedding: Standard sentence transformer embedding model adapted for this db setup.

Text Splitters

RecursiveTextSplitter: Bog standard langchain recursive text splitter built to follow modularity structure.
ParagraphTextSplitter WIP! - untested: Aims to implement a Text splitter that splits based on paragraphs rather than on token count.

Custom Implementation Example: Scientific Plot Database

Example workflow and implementation - simple version

Setting up new modules:

from chATLAS_Embed.Base import DocumentLoader, Document, TextSplitter
from chATLAS_Embed import VectorStoreCreator
from typing import List
from pathlib import Path


# Create a new document loader to load a new type of document
class PlotDocumentLoader(DocumentLoader):
    def load_documents(self, input_path: Path) -> List[Document]:
        """Load plots from the input path."""

        # Example workflow for creating documents for the plots
        documents = []
        for plot in input_path:
            doc = Document(
                page_content=plot.description,
                # Below metadata is only an example, redefine with whatever metadata you want
                metadata={
                    'type': 'plot',
                    'name': plot.name,
                    'url': plot.url,
                    'last_modification': plot.lastModification,  # Or created date
                    'title': plot.title
                    # plus any others
                },
                id=plot.paperName
            )
            documents.append(doc)

        return documents


# Create a new text splitter for parent doucment to contain the full description

class PlotParentTextSplitter(TextSplitter):

    def split(self, text: str) -> [str]:
        """Want full description in parent doc so can return full description as chunk"""
        return [text]

Actually creating the vectorstore

from chATLAS_Embed import SentenceTransformerEmbedding, RecursiveTextSplitter, PostgresParentChildVectorStore, VectorStoreCreator
from pathlib import Path

# Initialize components
embedding_model = SentenceTransformerEmbedding(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

parent_splitter = PlotParentTextSplitter()


child_splitter = RecursiveTextSplitter(
    chunk_size=256, chunk_overlap=24  # All mini LM has a context window of 256 so unless you want to use a better model (e.g. multi-qa-mpnet-base-dot-v1) text needs to be split to this size
)

vector_store = PostgresParentChildVectorStore(
    connection_string="postgresql://{user}:{password}@{server}:{port}/plot_db",  # Change this for wherever you want to create the vectorstore
    embedding_model=embedding_model
)

# Create Plot vector store with document loader
document_loader = PlotDocumentLoader()
document_chunker = DocumentChunker(
    parent_splitter=parent_splitter,
    child_splitter=child_splitter
)
plot_creator = VectorStoreCreator(
    vector_store=vector_store,
    document_loader=document_loader,
    document_chunker=document_chunker,
    output_dir=Path("./output")  # Not currently used
)


# Process Plot documents
plot_creator.create_update_vectorstore(
  input_path=Path("path/to/plots",   # However you want to load or store plots
                  update=True)
)

Searching the database once created:

vector_store = PostgresParentChildVectorStore(
    connection_string="postgresql://{user}:{password}@{server}:{port}/plot_db",  # Change this for wherever you want to create the vectorstore
    embedding_model=embedding_model
)

vector_store.search("Find me a plot about track reconstruction at ATLAS")

Example workflow and implementation - more custom implementation

This is much more in depth rewrite of code building custom vector stores and ways of processing documents. It is in no way tested, just shows example workflow on how something like this could work!

NOTE: This is not tested or complete, just a guide of how something like this could work!!

from chATLAS_Embed.Base import VectorStore, Document
from typing import List, Dict, Any, Optional
import psycopg2
import json

class ScientificPlotVectorStore(VectorStore):
    def __init__(self, connection_string: str, embedding_model: EmbeddingModel):
        """
        Initialize the Scientific Plot Vector Store.

        Args:
            connection_string: PostgreSQL connection string
            embedding_model: Model for embedding plot descriptions
        """
        self.connection_string = connection_string
        self.embedding_model = embedding_model
        self._initialize_database()

    def _initialize_database(self):
        """Create necessary tables and indexes."""
        with psycopg2.connect(self.connection_string) as conn:
            with conn.cursor() as cur:
                # Create parent table for plots
                cur.execute("""
                    CREATE TABLE IF NOT EXISTS plot_documents (
                        id TEXT PRIMARY KEY,
                        plot_path TEXT NOT NULL,
                        title TEXT NOT NULL,
                        metadata JSONB,
                        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                    )
                """)

                # Create child table for plot descriptions
                cur.execute("""
                    CREATE TABLE IF NOT EXISTS plot_descriptions (
                        id TEXT PRIMARY KEY,
                        plot_id TEXT REFERENCES plot_documents(id),
                        description TEXT NOT NULL,
                        metadata JSONB,
                        embedding vector(384),
                        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                    )
                """)

                # Create indexes
                cur.execute("""
                    CREATE INDEX IF NOT EXISTS plot_metadata_idx
                    ON plot_documents USING GIN (metadata)
                """)
                cur.execute("""
                    CREATE INDEX IF NOT EXISTS desc_embedding_idx
                    ON plot_descriptions
                    USING ivfflat (embedding vector_cosine_ops)
                    WITH (lists = 100)
                """)

    def add_documents(self, parent_docs: List[Document], child_docs: List[Document]):
        """
        Add plot documents and their descriptions to the database.

        Args:
            parent_docs: List of plot Documents (containing plot paths)
            child_docs: List of description Documents
        """
        with psycopg2.connect(self.connection_string) as conn:
            with conn.cursor() as cur:
                # Add parent plot documents
                for doc in parent_docs:
                    cur.execute("""
                        INSERT INTO plot_documents (id, plot_path, title, metadata)
                        VALUES (%s, %s, %s, %s)
                        ON CONFLICT (id) DO UPDATE
                        SET plot_path = EXCLUDED.plot_path,
                            title = EXCLUDED.title,
                            metadata = EXCLUDED.metadata
                    """, (
                        doc.id,
                        doc.metadata.get('plot_path'),
                        doc.metadata.get('title'),
                        json.dumps(doc.metadata)
                    ))

                # Generate embeddings for descriptions
                descriptions = [doc.page_content for doc in child_docs]
                embeddings = self.embedding_model.embed(descriptions)

                # Add child description documents with embeddings
                for doc, embedding in zip(child_docs, embeddings):
                    cur.execute("""
                        INSERT INTO plot_descriptions
                        (id, plot_id, description, metadata, embedding)
                        VALUES (%s, %s, %s, %s, %s)
                        ON CONFLICT (id) DO UPDATE
                        SET description = EXCLUDED.description,
                            metadata = EXCLUDED.metadata,
                            embedding = EXCLUDED.embedding
                    """, (
                        doc.id,
                        doc.parent_id,
                        doc.page_content,
                        json.dumps(doc.metadata),
                        embedding.tolist()
                    ))

    def search(
        self,
        query: str,
        k: int = 4,
        metadata_filters: Optional[Dict[str, Any]] = None
    ) -> List[Document]:
        """
        Search for plots based on description similarity.

        Args:
            query: Search query
            k: Number of results to return
            metadata_filters: Optional filters for metadata fields

        Returns:
            List of Documents containing plot information and descriptions
        """
        # Generate query embedding
        query_embedding = self.embedding_model.embed([query])[0]

        with psycopg2.connect(self.connection_string) as conn:
            with conn.cursor() as cur:
                # Construct base query
                query_sql = """
                    SELECT
                        pd.id as desc_id,
                        pd.description,
                        pd.metadata as desc_metadata,
                        p.id as plot_id,
                        p.plot_path,
                        p.title,
                        p.metadata as plot_metadata,
                        1 - (pd.embedding <=> %s) as similarity
                    FROM plot_descriptions pd
                    JOIN plot_documents p ON pd.plot_id = p.id
                    WHERE 1=1
                """
                params = [query_embedding.tolist()]

                # Add metadata filters if provided
                if metadata_filters:
                    for key, value in metadata_filters.items():
                        query_sql += f" AND p.metadata->>{key} = %s"
                        params.append(value)

                # Add similarity ordering and limit
                query_sql += """
                    ORDER BY similarity DESC
                    LIMIT %s
                """
                params.append(k)

                # Execute search
                cur.execute(query_sql, params)
                results = cur.fetchall()

                # Convert results to Documents
                documents = []
                for result in results:
                    desc_id, description, desc_metadata, plot_id, plot_path, title, plot_metadata, similarity = result

                    # Create child document (description)
                    child_doc = Document(
                        page_content=description,
                        metadata={
                            **json.loads(desc_metadata),
                            'similarity': similarity
                        },
                        id=desc_id,
                        parent_id=plot_id
                    )

                    # Create parent document (plot)
                    parent_doc = Document(
                        page_content=f"Title: {title}\nPlot Path: {plot_path}",
                        metadata=json.loads(plot_metadata),
                        id=plot_id
                    )

                    documents.append((child_doc, parent_doc))

                return documents

    def delete(self, document_ids: Optional[List[str]] = None):
        """
        Delete documents from the database.

        Args:
            document_ids: Optional list of document IDs to delete
        """
        with psycopg2.connect(self.connection_string) as conn:
            with conn.cursor() as cur:
                if document_ids:
                    # Delete specific documents
                    cur.execute("""
                        DELETE FROM plot_descriptions
                        WHERE id = ANY(%s) OR plot_id = ANY(%s)
                    """, (document_ids, document_ids))
                    cur.execute("""
                        DELETE FROM plot_documents
                        WHERE id = ANY(%s)
                    """, (document_ids,))
                else:
                    # Delete all documents
                    cur.execute("DELETE FROM plot_descriptions")
                    cur.execute("DELETE FROM plot_documents")

Usage example for the Scientific Plot Vector Store:

# Initialize the store
plot_store = ScientificPlotVectorStore(
    connection_string="postgresql://user:pass@localhost:5432/plots_db",
    embedding_model=SentenceTransformerEmbedding()
)

# Create parent document for a plot
plot_doc = Document(
    page_content="",  # Empty as content is in metadata
    metadata={
        'plot_path': '/path/to/plot.png',
        'title': 'Higgs Boson Mass Distribution',
        'experiment': 'ATLAS',
        'year': '2023'
    },
    id='plot_001'
)

# Create child documents for plot descriptions
desc_docs = [
    Document(
        page_content="Distribution of invariant mass showing clear peak at 125 GeV",
        metadata={'section': 'results'},
        id='desc_001',
        parent_id='plot_001'
    ),
    Document(
        page_content="Signal region with background subtraction applied",
        metadata={'section': 'analysis'},
        id='desc_002',
        parent_id='plot_001'
    )
]

# Add documents to store
plot_store.add_documents([plot_doc], desc_docs)

# Search for plots
results = plot_store.search(
    query="mass distribution peak",
    k=5,
    metadata_filters={'experiment': 'ATLAS'}
)

🔧 Development Status

Current development priorities:

HNSW indexing implementation
Bulk deletion optimization
Advanced filtering capabilities
Function to check connection to datastore
Parent-child relationship support
PostgreSQL integration

📄 License

chATLAS_Embed is released under Apache 2.0 license.

Made with ❤️ by the ATLAS Collaboration

For questions and support, please contact

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.20

Apr 9, 2026

This version

0.1.19

Aug 26, 2025

0.1.18

Jun 6, 2025

0.1.17

May 28, 2025

0.1.16

Apr 17, 2025

0.1.15

Apr 2, 2025

0.1.14

Mar 21, 2025

0.1.13

Feb 28, 2025

0.1.12

Feb 28, 2025

0.1.11

Feb 23, 2025

0.1.10

Feb 23, 2025

0.1.9

Feb 21, 2025

0.1.8

Feb 5, 2025

0.1.7

Feb 1, 2025

0.1.6

Jan 31, 2025

0.1.5

Jan 31, 2025

0.1.4

Jan 30, 2025

0.1.3

Jan 28, 2025

0.1.2

Jan 28, 2025

0.1.1

Jan 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chatlas_embed-0.1.19.tar.gz (134.2 kB view details)

Uploaded Aug 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chatlas_embed-0.1.19-py3-none-any.whl (129.9 kB view details)

Uploaded Aug 26, 2025 Python 3

File details

Details for the file chatlas_embed-0.1.19.tar.gz.

File metadata

Download URL: chatlas_embed-0.1.19.tar.gz
Upload date: Aug 26, 2025
Size: 134.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for chatlas_embed-0.1.19.tar.gz
Algorithm	Hash digest
SHA256	`106650e86b6045b5c074d5af6a776cb7b26cfe50a53a42f310f9f48672e96a6a`
MD5	`ad18282fcae91f2efa5220cdda7276a0`
BLAKE2b-256	`cc51a7168537d80063e9f884382daf7109c7653e20d3d760c5cf84608ee8f81a`

See more details on using hashes here.

File details

Details for the file chatlas_embed-0.1.19-py3-none-any.whl.

File metadata

Download URL: chatlas_embed-0.1.19-py3-none-any.whl
Upload date: Aug 26, 2025
Size: 129.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for chatlas_embed-0.1.19-py3-none-any.whl
Algorithm	Hash digest
SHA256	`72ea1bf242452350c01842ee0e500be019879cb4dcc217e3c2e983d505ce6adf`
MD5	`f3906666879b0f9f3bc951c696c2fde1`
BLAKE2b-256	`cf6d0cd831f4545bd27aa0310095390bc0cf4b062d07dacf4ae35d4886260827`

See more details on using hashes here.

chATLAS_Embed 0.1.19

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

chATLAS Embed

📚 Overview

🌟 Key Features

🚀 Quick Start

📥 Installation

💡 Basic Usage - Connecting to an existing vectorstore

🚀 Getting Started Locally

Contents

Before starting, when using local DB instances

Usage

A. Creating a Vector Store

1. Initialize Components

2. Create and Populate the Vector Store

B. Querying an Existing Database

1. Connect to the Database

2. Search the Database

Simple Search

Advanced Search

C. Closing the database properly

D. Integration with LangChain

Notes:

C. Document Dataclass

🔍 Hybrid Retrieval

Search Types

Usage

Results

Advanced Features

Current Implementations

Vector Stores

PostgresParentChildVectorStore

Example Performance Metrics

Key Features

VectorStoreCreator and DocumentLoaders

DocumentLoaders

Embedding Models

Text Splitters

Custom Implementation Example: Scientific Plot Database

🔧 Development Status

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

C. `Document` Dataclass