Indox Retrieval Augmentation
Project description
inDox
Advanced Search and Retrieval Augmentation Generative
Official Website • Documentation • Discord
NEW: Subscribe to our mailing list for updates and news!
Indox Retrieval Augmentation is an innovative application designed to streamline information extraction from a wide range of document types, including text files, PDF, HTML, Markdown, and LaTeX. Whether structured or unstructured, Indox provides users with a powerful toolset to efficiently extract relevant data.
Indox Retrieval Augmentation is an innovative application designed to streamline information extraction from a wide range of document types, including text files, PDF, HTML, Markdown, and LaTeX. Whether structured or unstructured, Indox provides users with a powerful toolset to efficiently extract relevant data. One of its key features is the ability to intelligently cluster primary chunks to form more robust groupings, enhancing the quality and relevance of the extracted information. With a focus on adaptability and user-centric design, Indox aims to deliver future-ready functionality with more features planned for upcoming releases. Join us in exploring how Indox can revolutionize your document processing workflow, bringing clarity and organization to your data retrieval needs.
Dependency Requirements
Before running this project, ensure that you have the following installed:
- Python 3.8+: Required for running the Python backend.
- PostgreSQL: Needed if you wish to store your data in a PostgreSQL database.
- OpenAI API Key: Necessary if you are using the OpenAI embedding model.
- HuggingFace API Key: Necessary if you are using the HuggingFace llms.
Ensure your system also meets these requirements:
- Access to environmental variables for handling sensitive information like API keys.
- Suitable hardware capable of supporting intensive computational tasks.
Installation
Getting Started
The following command will install the latest stable inDox
pip install Indox
To install the latest development version, you may run
pip install git+https://github.com/osllmai/inDox@main
To configure the CLI, run
indox configure
Clone the repository and navigate to the directory:
git clone https://github.com/osllmai/inDox.git
cd inDox
Install the required Python packages:
pip install -r requirements.txt
Configuration
Environment Variables
Set your OPENAI_API_KEY
or HF_API_KEY
in your environment variables for secure access.
Database Setup
Ensure your PostgreSQL database is up and running, and accessible from your application. This is necessary if you plan to use pgvector as your vector store.
Alternatively, you can use Chroma or Faiss as your vector store. Make sure to specify your choice and the necessary configurations in the config.yaml file.
Usage
Preparing Your Data
- Define the File Path: Specify the path to your text or PDF file.
- Load Embedding Models: Initialize your embedding model from OpenAI's selection of pre-trained models.
Quick Start
Import Indox Package
Import the necessary classes from the Indox package.
from Indox import IndoxRetrievalAugmentation
Initialize Indox
Create an instance of IndoxRetrievalAugmentation.
Indox = IndoxRetrievalAugmentation()
Initial Configuration
- Configuration File: Ensure you locate and modify the
Indox.config
YAML file according to your needs before starting the application.
Dynamic Configuration Changes
For changes that need to be applied after the initial setup or during runtime:
- Modifying Configurations: Use the following Python snippet to update your settings dynamically:
Indox.config["your_setting_that_need_to_change"] = "new_setting" Indox.update_config()
Configuration Details
Here's a breakdown of the config dictionary and its properties:
PostgreSQL
conn_string
: Your PostgreSQL database credentials.
Summary Model
max_tokens
: Maximum token count the summary model can generate.min_len
: Minimum token count the summary model generates.model_name
: Default isgpt-3.5-turbo-0125
, but it can be replaced with any Hugging Face model supporting the summarization pipeline.
PostgreSQL Setup with pgvector
If you want to use PostgreSQL for vector storage, you should perform the following steps:
-
Install pgvector: To install
pgvector
on your PostgreSQL server, follow the detailed installation instructions available on the official pgvector GitHub repository: pgvector Installation Instructions -
Add Vector Extension: Connect to your PostgreSQL database and execute the following SQL command to create the
pgvector
extension:-- Connect to your database psql -U username -d database_name -- Run inside your psql terminal CREATE EXTENSION vector; # Replace the placeholders with your actual PostgreSQL credentials and details
Additionally, for those interested in exploring other vector database options, you can consider using Chroma or * Faiss*. These provide alternative approaches to vector storage and retrieval that may better suit specific use cases or performance requirements.
Importing QA and Embedding Models
from Indox.QaModels import OpenAiQA
from Indox.Embeddings import OpenAiEmbedding
openai_qa = OpenAiQA(api_key=OPENAI_API_KEY,model="gpt-3.5-turbo-0125")
openai_embeddings = OpenAiEmbedding(model="text-embedding-3-small",openai_api_key=OPENAI_API_KEY)
Modifying Configuration Settings
To change a configuration setting, you can directly modify the
Indox.config
dictionary. Here is an example of how you can update a
configuration setting:
# Example of modifying a configuration setting
Indox.config["old_config"] = "new_config"
# Applying the updated configuration
Indox.update_config()
We take advantage of the unstructured
library to load
documents and split them into chunks by title. This method helps in
organizing thme document into manageable sections for further
processing.
from Indox.DataLoaderSplitter import UnstructuredLoadAndSplit
docs_unstructured = UnstructuredLoadAndSplit(file_path=file_path)
Starting processing...
End Chunking process.
Storing document chunks in a vector store is crucial for enabling efficient retrieval and search operations. By converting text data into vector representations and storing them in a vector store, you can perform rapid similarity searches and other vector-based operations.
Indox.connect_to_vectorstore(collection_name="sample",embeddings=openai_embeddings)
Indox.store_in_vectorstore(chunks=docs_unstructured)
Quering
query = "your query!!??"
response_openai = Indox.answer_question(query=query,qa_model=openai_qa)
answer = response_openai[0]
context, score = response_openai[1]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.