This repository contains a project that implements a Retrieval-Augmented Generation (RAG) system using the LLaMA3 model. The project focuses on creating embeddings for instructions of a professional bioinformatic software to help users conduct biology research.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License (GPL)
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Software Development :: Build Tools

Project description

RAG-LLaMA3 AI Project

Directory Structure

ai/
├── data/
│   ├── raw/
│   │   ├── pdfs/
│   │   ├── htmls/
│   │   └── ...
│   ├── processed/
│   │   ├── texts/
│   │   ├── embeddings/
│   │   └── ...
│   └── example_data/
├── notebooks/
│   ├── 01_data_extraction.ipynb
│   ├── 02_data_preprocessing.ipynb
│   ├── 03_embedding_generation.ipynb
│   └── 04_rag_pipeline.ipynb
├── scripts/
│   ├── extract_text.py
│   ├── preprocess_text.py
│   ├── generate_embeddings.py
│   ├── create_rag_pipeline.py
│   └── ...
├── src/
│ ├── init.py
│ ├── main.py
│ ├── VectorDB.py
| ├── config_template.yaml
│ ├── TextExtractor.py
│ └── RAG.py
├── tests/
│   ├── test_extract.py
│   ├── test_preprocess.py
│   ├── test_generate_embeddings.py
│   ├── test_rag_pipeline.py
│   └── ...
├── requirements.txt
├── setup.py
├── README.md
├── LICENSE
├── .gitignore
└── config.yaml

Getting Started

Prerequisites

Python 3.10.12
PyTorch
CUDA (if using GPU)
LLaMA3

Installation

Clone the repository:

git clone git@github.com:DNALinux/ai.git
cd ai

Create the Environment from 'environment.yml':
```
conda env create -f environment.yml
```

Configuration

Edit the config_template.yaml file under src directory to set your project-specific parameters and paths.

Project Workflow

Data Extraction: Extract text from raw data sources.
Data Preprocessing: Preprocess the extracted text. We provide functions for preprocessing text but did not use them when creating vector embeddings.
Embedding Generation: Generate embeddings for the preprocessed text using LLaMA3 and store them in a Chroma database.
RAG Pipeline: Set up and run the RAG pipeline.

How to Use

There is a Instruction.ipynb you can use to test the code.

Configuration:
- Open ~/ai/src/config_template.yaml to set up the directory paths for storing your data and Chroma database. You do not need to create them manually; just specify where they should be, and they will be automatically created:
  - input_dir: Directory for PDF, HTML files, and URLs.
  - urls_path: Path to a file named urls.txt where you put all the urls.
  - output_dir: Directory where TextExtractor will store all .txt files (for debugging purposes).
  - chroma_db_dir: Directory where your Chroma database will be stored.
  - chroma_db_name: Collection name for your Chroma database.
- Note: The embedding model defaults to 'mxbai-embed-large'. Feel free to choose your preferred Ollama embedding model.
```
from src import RAG as rag
from src import TextExtractor as te
from src import VectorDB as vdb
# Load configuration, please insert the path to your configuration file.
config = rag.load_config('/home/tagore/repos/ai/src/config_template.yaml')

# Extract configuration values
vector_db_config = config.get('vector_db')
input_dir = vector_db_config.get('input_dir')
output_dir = vector_db_config.get('output_dir')
urls_path = vector_db_config.get('urls_path')
chroma_db_dir = vector_db_config.get('chroma_db_dir')
chroma_db_name = vector_db_config.get('chroma_db_name')
model = vector_db_config.get('model')
```
Directory Setup:
- Open a Jupyter notebook and run the following code to ensure that your directory is created:
```
test_vector_db = vdb.VectorDB(input_dir, output_dir, urls_path, chroma_db_dir, chroma_db_name)
```
- This will create an object that you can use to manipulate your Chroma vector database. It will automatically create all the directories and an empty Chroma database. If everything is already created, it will not overwrite existing files.
Add Files:
- Place all PDF and HTML files in the input directory. List all URLs in the urls.txt file, each on a new line.
Load Data:
- Use the test_vector_db object to load files into the vector database:
```
test_vector_db.load_data()
```
- Alternatively, load different types of files individually:
```
test_vector_db.load_url()
test_vector_db.load_pdf()
test_vector_db.load_html()
```
- Loading might take some time. After loading, check if the vector database has been populated successfully:
```
test_vector_db.peek()
test_vector_db.show_sources()
```
- Query data from a specific source:
```
test_vector_db.query_sources(source_name)
```
- To delete data from a source:
```
test_vector_db.delete_source(source_name)
```
- Or to clear the entire database (be cautious as this is destructive):
```
test_vector_db.clear_database()
```

Generate Answers:

To get your RAG version answer from the command line, navigate to the src path and type:
```
python3 main.py "your question goes here"
```

Or in a Jupyter notebook, use:

testRAG = rag.RAG(input, output, chroma_db, collection_name)
print(testRAG.generate_answer("Your question goes here"))

Other Class and Function Introduction

TextExtractor Class

The TextExtractor class is designed to handle text extraction from various sources such as PDF files, HTML files, and URLs. This class facilitates the transformation of raw text data into a format suitable for creating vector embeddings and RAG (Retrieval-Augmented Generation) systems. It also includes functionality for saving extracted text to files for debugging purposes.

Key Functionalities

get_pdf: Returns a list of PDF file paths from the input directory.
get_html: Returns a list of HTML file paths from the input directory.
get_urls: Retrieves URLs from a text file within the input directory.
save_text_to_file: Saves a given text to a specified file for debugging purposes.
save_a_pdf_text: Extracts text from a PDF file and saves each page's text to separate files. It ensures limited overlap between pages and avoids saving pages with only one line of text.
save_pdfs_texts: Extracts text from multiple PDF files and saves each page's text to separate files.
extract_pdf_texts: Extracts text from a PDF file and returns it as a list, ensuring limited overlap between pages and avoiding pages with only one line of text.
save_a_html_text: Extracts text from an HTML file and saves it to multiple files with specified character limits and overlap.
save_html_texts: Extracts text from multiple HTML files and saves them to multiple files.
extract_html_text: Extracts text from an HTML file and returns it as a list of text chunks with specified character limits and overlap.
is_relevant_link: Determines if a link is relevant based on specific heuristics.
extract_url_text: Extracts headings and paragraphs from a URL and returns the text along with relevant links.
save_chunks_to_files: Saves text chunks to files in the specified directory.
split_text_into_chunks: Splits text into chunks with specified overlap.
get_main_domain: Extracts the main domain from a URL.
crawl_and_save: Crawls a website recursively, extracts text, and saves it to files.
crawl_and_extract: Crawls a website recursively, extracts text, and returns it as a list of text chunks.

from src import TextExtractor as te

# Initialize TextExtractor with input and output directories
text_extractor = te.TextExtractor(input_dir='path/to/input', output_dir='path/to/output', urls_file = 'path/to/urls.txt')

# Get PDF File Paths
pdf_files = text_extractor.get_pdf()
print(pdf_files)  # List of PDF file paths

#Extract and Save PDF Texts to Files
text_extractor.save_pdfs_texts(pdf_files) 

#Extract PDF Texts to a List
pdf_texts = text_extractor.extract_pdf_texts('path/to/sample.pdf')
print(pdf_texts)  # List of extracted text chunks from the PDF

#The same process applies for HTML and URL

VectorDB Class

The VectorDB class manages a Chroma vector database, allowing for the extraction, processing, and storage of text data from various sources, including PDFs, HTML files, and URLs. This class integrates with the TextExtractor class for text extraction and uses the Ollama model for generating vector embeddings.

Key Functionalities

__init__: Initializes the VectorDB instance, sets up directories, and loads or creates the Chroma vector database.
_load_or_create_vector_db: Loads an existing Chroma vector database or creates a new one.
load_data: Loads and processes data from URLs, PDFs, and HTML files.
_get_embeddings: Gets embeddings for a list of texts from Ollama.
_process_and_add: Processes texts and adds them to the vector database.
load_url: Loads and processes data from URLs.
load_pdf: Loads and processes data from PDFs.
load_html: Loads and processes data from HTML files.
peek: Retrieves the first 10 documents from the vector database.
delete_source: Deletes all documents from a specific source.
show_sources: Shows all sources in the vector database.
query_sources: Queries the vector database with a specific source.
query: Queries the vector database with a text prompt.
clear_database: Clears the vector database.

`RAG` Class Introduction

The RAG (Retrieval-Augmented Generation) class provides a comprehensive solution for integrating a vector database with the LLaMA3 model to generate and stream answers based on user queries. It includes functionalities for setting up the vector database, generating prompts, and creating answers with both regular and streaming outputs.

Key Functionalities

Initialization:
- __init__: Initializes the RAG class, setting up the vector database with specified directories and a default embedding model.
- _setup_vector_db: Checks if the vector database exists and populates it with data if needed.
- _is_database_populated: Verifies if the vector database contains data.
Prompt Generation:
- generate_prompt: Creates a prompt for querying the model, incorporating the user's question and the retrieved context.
Answer Generation:
- generate_answer: Generates an answer using the vector database and the LLaMA3 model.
- stream_answer: Generates an answer with streaming output, providing real-time response chunks.

`preprocess_text`

The preprocess_text function performs a series of text preprocessing steps to clean and standardize raw text. It includes:

Lowercasing: Converts all text to lowercase to ensure uniformity.
Removing Non-Alphanumeric Characters: Strips out any characters that are not letters, numbers, or whitespace.
Tokenization: Splits the text into individual words or tokens.
Stopwords Removal: Filters out common words that may not contribute meaningful information to the analysis.
Lemmatization: Reduces words to their base or root form.

This function outputs a preprocessed text string that is ready for further analysis or processing.

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License (GPL)
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Software Development :: Build Tools

Release history Release notifications | RSS feed

0.1.4.1

Aug 17, 2024

0.1.4

Aug 17, 2024

0.1.3

Aug 17, 2024

0.1.2

Aug 17, 2024

0.1.2.dev4 pre-release

Aug 17, 2024

0.1.2.dev3 pre-release

Aug 17, 2024

0.1.2.dev2 pre-release

Aug 17, 2024

0.1.2.dev1 pre-release

Aug 17, 2024

0.1.1

Aug 17, 2024

This version

0.1.0

Aug 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_llama3-0.1.0.tar.gz (18.1 kB view details)

Uploaded Aug 13, 2024 Source

Built Distribution

rag_llama3-0.1.0-py3-none-any.whl (17.8 kB view details)

Uploaded Aug 13, 2024 Python 3

File details

Details for the file rag_llama3-0.1.0.tar.gz.

File metadata

Download URL: rag_llama3-0.1.0.tar.gz
Upload date: Aug 13, 2024
Size: 18.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for rag_llama3-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`96b15cb36d5980c2c8025d0f746a330e0b7345778da4fae9c79daa72426d895b`
MD5	`5839c97d3b29bbcdfdaf1a132885afb1`
BLAKE2b-256	`6ae1ddbdc3659710d38de330cb9ada669e617b1d5ebbad6483e2c8d57dc0146f`

See more details on using hashes here.

File details

Details for the file rag_llama3-0.1.0-py3-none-any.whl.

File metadata

Download URL: rag_llama3-0.1.0-py3-none-any.whl
Upload date: Aug 13, 2024
Size: 17.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for rag_llama3-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d513c1a5562627c6f68747d7f23f539d701540a8f9a3b9eaee797bb8fad0210e`
MD5	`e8281ffba1dcd11b58b5295864d45e4a`
BLAKE2b-256	`6cfd462ec18b8d92035e00fc3975cf550b0b953261fcfaf5ffb5c09bce16c789`