Embedd (docs, pdfs, excels, csv etc) -> RAG -> Query with LLMs

Project description

embedd-all

embedd-all is a Python package designed to convert various document formats into a format that can be used to create an embedding vector using embedding models. The package extracts text from PDFs, summarizes data from Excel files, and now includes functionality to create RAG (Retrieval-Augmented Generation) for documents using Voyage AI or OpenAI embedding models and Pinecone vector database. It supports file formats including xlsx, csv, pdf, doc, and docx.

Features

Multi-format Support: Supports PDF, Excel (xlsx, csv), and Word (doc, docx) file processing.
PDF Processing: Extracts text from each page of a PDF and returns it as an array.
Excel Processing: Summarizes the data in each sheet by concatenating column names and their respective values, creating a new column df["summarized"]. If the Excel file contains multiple sheets, it processes each sheet and returns all summaries.
RAG Creation: Creates RAG for documents (all supported formats) using either Voyage AI or OpenAI embedding models and stores them in a Pinecone vector database.
Multiple Embedding Models: Supports both Voyage AI and OpenAI embedding models for flexible integration.

Installation

Install the package via pip:

pip install embedd-all

Usage

Import the package

from embedd_all.index import modify_excel_for_embedding, process_pdf, pinecone_embeddings_with_voyage_ai, rag_query

Example Usage

Processing an Excel File

The modify_excel_for_embedding function processes an Excel file, summarizes each row, and returns the summaries.

import pandas as pd
from embedd_all.embedd.index import modify_excel_for_embedding

if __name__ == '__main__':
    # Path to the Excel file
    file_path = '/path/to/your/data.xlsx'
    context = "data"

    # Process the Excel file
    dfs = modify_excel_for_embedding(file_path=file_path, context=context)

    # Display the summarized data from the second sheet (if exists)
    if len(dfs) > 1:
        logger.info(dfs[1].head(3))

Processing a PDF File

The process_pdf function extracts text from each page of a PDF file and returns it as an array.

from embedd_all.embedd.index import process_pdf

if __name__ == '__main__':
    # Path to the PDF file
    file_path = '/path/to/your/document.pdf'

    # Process the PDF file
    texts = process_pdf(file_path)

    # Display the processed text
    logger.info("Number of pages processed: ", len(texts))
    logger.info("Sample text from the first page: ", texts[0])

Creating RAG for Documents

The pinecone_embeddings_with_voyage_ai or pinecone_embeddings_with_openai function creates RAG for documents using your preferred embedding model and stores them in a Pinecone vector database. This function supports multiple file formats including xlsx, csv, pdf, doc, and docx.

from embedd_all.embedd.index import pinecone_embeddings_with_voyage_ai, pinecone_embeddings_with_openai

def create_rag_for_documents():
    paths = [
        '/Users/arnabbhattachargya/Desktop/flamingo_english_book.pdf',
        '/Users/arnabbhattachargya/Desktop/Data_Train.xlsx'
    ]
    vector_db_name = 'arnab-test'
    
    # Using Voyage AI
    voyage_embed_model = 'voyage-2'
    embed_dimension = 1024
    pinecone_embeddings_with_voyage_ai(paths, PINECONE_KEY, VOYAGE_API_KEY, vector_db_name, voyage_embed_model, embed_dimension)
    
    # Using OpenAI
    openai_embed_model = 'text-embedding-3-small'  # or 'text-embedding-3-large'
    embed_dimension = 1536  # 1536 for small, 3072 for large
    pinecone_embeddings_with_openai(paths, PINECONE_KEY, OPENAI_API_KEY, vector_db_name, openai_embed_model, embed_dimension)

if __name__ == '__main__':
    create_rag_for_documents()

Querying with RAG

The rag_query function performs context-based querying using RAG (Retrieval-Augmented Generation). You can use either Voyage AI or OpenAI embeddings for querying.

from embedd_all.embedd.index import rag_query

def execute_rag_query():
    CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
    INDEX_NAME = 'arnab-test'
    TEMPERATURE = 0
    MAX_TOKENS = 4000
    QUERY = 'what all fuel types are there in cars?'
    SYSTEM_PROMPT = "You are a world-class document writer. Respond only with detailed descriptions and implementations. Use bullet points if necessary."
    
    # Using Voyage AI embeddings
    VOYAGE_EMBED_MODEL = 'voyage-2'
    resp = rag_query(
        temperature=TEMPERATURE,
        max_tokens=MAX_TOKENS,
        anthropic_api_key=ANTHROPIC_API_KEY,
        claude_model=CLAUDE_MODEL,
        index_name=INDEX_NAME,
        pinecone_key=PINECONE_KEY,
        query=QUERY,
        system_prompt=SYSTEM_PROMPT,
        voyage_api_key=VOYAGE_API_KEY,
        voyage_embed_model=VOYAGE_EMBED_MODEL
    )
    
    # Using OpenAI embeddings
    OPENAI_EMBED_MODEL = 'text-embedding-3-small'
    resp = rag_query(
        temperature=TEMPERATURE,
        max_tokens=MAX_TOKENS,
        anthropic_api_key=ANTHROPIC_API_KEY,
        claude_model=CLAUDE_MODEL,
        index_name=INDEX_NAME,
        pinecone_key=PINECONE_KEY,
        query=QUERY,
        system_prompt=SYSTEM_PROMPT,
        openai_api_key=OPENAI_API_KEY,
        openai_embed_model=OPENAI_EMBED_MODEL
    )

    for text_block in resp:
        print(text_block.text)

if __name__ == '__main__':
    execute_rag_query()

Functions

`modify_excel_for_embedding(file_path: str, context: str) -> list`

Processes an Excel file and summarizes the data in each sheet.

Parameters:
- file_path (str): Path to the Excel file.
- context (str): Additional context to be added to each summary.
Returns:
- list: A list of DataFrames, each containing the summarized data for each sheet.

`process_pdf(file_path: str) -> list`

Extracts text from each page of a PDF file.

Parameters:
- file_path (str): Path to the PDF file.
Returns:
- list: A list of strings, each representing the text extracted from a page.

`pinecone_embeddings_with_voyage_ai(paths: list, PINECONE_KEY: str, VOYAGE_API_KEY: str, vector_db_name: str, voyage_embed_model: str, embed_dimension: int)`

Creates RAG for documents using Voyage AI embedding models and stores them in a Pinecone vector database. Supports various document formats including xlsx, csv, pdf, doc, and docx.

Parameters:
- paths (list): List of paths to documents.
- PINECONE_KEY (str): Pinecone API key.
- VOYAGE_API_KEY (str): Voyage AI API key.
- vector_db_name (str): Name of the Pinecone vector database.
- voyage_embed_model (str): Name of the Voyage AI embedding model to use.
- embed_dimension (int): Dimension of the embedding vectors.

`pinecone_embeddings_with_openai(paths: list, PINECONE_KEY: str, OPENAI_API_KEY: str, vector_db_name: str, openai_embed_model: str, embed_dimension: int)`

Creates RAG for documents using OpenAI embedding models and stores them in a Pinecone vector database. Supports various document formats including xlsx, csv, pdf, doc, and docx.

Parameters:
- paths (list): List of paths to documents.
- PINECONE_KEY (str): Pinecone API key.
- OPENAI_API_KEY (str): OpenAI API key.
- vector_db_name (str): Name of the Pinecone vector database.
- openai_embed_model (str): Name of the OpenAI embedding model to use (e.g., 'text-embedding-3-small' or 'text-embedding-3-large').
- embed_dimension (int): Dimension of the embedding vectors (1536 for small, 3072 for large model).

`rag_query()`

Performs context-based querying using RAG (Retrieval-Augmented Generation).

Parameters:
- temperature (float): Sampling temperature.
- max_tokens (int): Maximum number of tokens in the response.
- anthropic_api_key (str): Anthropic API key.
- claude_model (str): Name of the Claude model to use.
- index_name (str): Name of the Pinecone index.
- pinecone_key (str): Pinecone API key.
- query (str): The query to perform.
- system_prompt (str): The system prompt for guiding the model's response.
- voyage_api_key (str): Voyage AI API key.
- voyage_embed_model (str): Name of the Voyage AI embedding model to use.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Contact

If you have any questions or suggestions, please open an issue or contact the maintainer.

Happy embedding with embedd-all!

Project details

Release history Release notifications | RSS feed

0.0.942

Apr 4, 2025

This version

0.0.941

Apr 4, 2025

0.0.940

Apr 4, 2025

0.0.939

Jan 13, 2025

0.0.938

Sep 5, 2024

0.0.937

Sep 5, 2024

0.0.936

Sep 5, 2024

0.0.935

Sep 4, 2024

0.0.934

Sep 3, 2024

0.0.933

Sep 3, 2024

0.0.932

Sep 3, 2024

0.0.931

Sep 3, 2024

0.0.929

Sep 3, 2024

0.0.928

Sep 3, 2024

0.0.927

Sep 3, 2024

0.0.926

Sep 3, 2024

0.0.925

Sep 3, 2024

0.0.924

Sep 2, 2024

0.0.923

Sep 1, 2024

0.0.922

Aug 27, 2024

0.0.921

Aug 26, 2024

0.0.93

Sep 3, 2024

0.0.92

Aug 26, 2024

0.0.91

Aug 3, 2024

0.0.9

Aug 1, 2024

0.0.8

Aug 1, 2024

0.0.8a0 pre-release

Aug 1, 2024

0.0.7rc1 pre-release

Aug 1, 2024

0.0.7rc0 pre-release

Aug 1, 2024

0.0.6

Aug 1, 2024

0.0.5

Jul 30, 2024

0.0.4

Jul 30, 2024

0.0.4a1 pre-release

Jul 30, 2024

0.0.3

Jul 29, 2024

0.0.2

Jul 29, 2024

0.0.1

Jul 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedd_all-0.0.941.tar.gz (15.7 kB view details)

Uploaded Apr 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embedd_all-0.0.941-py3-none-any.whl (12.9 kB view details)

Uploaded Apr 4, 2025 Python 3

File details

Details for the file embedd_all-0.0.941.tar.gz.

File metadata

Download URL: embedd_all-0.0.941.tar.gz
Upload date: Apr 4, 2025
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for embedd_all-0.0.941.tar.gz
Algorithm	Hash digest
SHA256	`26aec0d53d05051c87dad7046582652ac5de0d055244654d185aa1ba16fc2c26`
MD5	`195b68bec32432278403098ddd369afb`
BLAKE2b-256	`e8faa77ccfbedddb12268fee11f638b9e6bd0677e112e62109c0970c9eba3fcc`

See more details on using hashes here.

File details

Details for the file embedd_all-0.0.941-py3-none-any.whl.

File metadata

Download URL: embedd_all-0.0.941-py3-none-any.whl
Upload date: Apr 4, 2025
Size: 12.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for embedd_all-0.0.941-py3-none-any.whl
Algorithm	Hash digest
SHA256	`294bc38330b0b298adede8bcfbef4fe8a61e474c684c2243306b091de85d2c1a`
MD5	`18e0ea12381f88d623e5b6d18e38bbd4`
BLAKE2b-256	`83095597e1289d31e4b77069aaeb4daeadd67852497507afae95c1e2fbb82f08`

See more details on using hashes here.

embedd-all 0.0.941

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Project description

embedd-all

Features

Installation

Usage

Import the package

Example Usage

Processing an Excel File

Processing a PDF File

Creating RAG for Documents

Querying with RAG

Functions

modify_excel_for_embedding(file_path: str, context: str) -> list

process_pdf(file_path: str) -> list

pinecone_embeddings_with_voyage_ai(paths: list, PINECONE_KEY: str, VOYAGE_API_KEY: str, vector_db_name: str, voyage_embed_model: str, embed_dimension: int)

pinecone_embeddings_with_openai(paths: list, PINECONE_KEY: str, OPENAI_API_KEY: str, vector_db_name: str, openai_embed_model: str, embed_dimension: int)

rag_query()

License

Contributing

Contact

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`modify_excel_for_embedding(file_path: str, context: str) -> list`

`process_pdf(file_path: str) -> list`

`pinecone_embeddings_with_voyage_ai(paths: list, PINECONE_KEY: str, VOYAGE_API_KEY: str, vector_db_name: str, voyage_embed_model: str, embed_dimension: int)`

`pinecone_embeddings_with_openai(paths: list, PINECONE_KEY: str, OPENAI_API_KEY: str, vector_db_name: str, openai_embed_model: str, embed_dimension: int)`

`rag_query()`