RAG library
Project description
TakoLlama
A Python library for building Retrieval-Augmented Generation (RAG) systems with Ollama and ChromaDB. TakoLlama simplifies the process of extracting text from various sources, creating vector embeddings, and querying them with large language models.
Features
- Multi-format Text Extraction: Extract text from PDFs, HTML files, and web URLs
- Vector Database Management: Built-in ChromaDB integration for efficient similarity search
- Ollama Integration: Seamless integration with Ollama for embeddings and text generation
- Web Crawling: Intelligent web crawling with relevance filtering
- Flexible Data Processing: Configurable text chunking with overlap for better context preservation
Installation
pip install takollama
Prerequisites
- Python 3.12+
- Ollama installed and running
- Required models pulled in Ollama (e.g.,
ollama pull llama3.1,ollama pull mxbai-embed-large)
Quick Start
1. Basic RAG Pipeline
from takollama import RAG, VectorDB
# Initialize the RAG system
rag = RAG(
chroma_db_dir="./my_db",
chroma_db_name="my_collection",
v_model="mxbai-embed-large"
)
# Load data from PDFs and HTML files
rag.vector_db.load_data(input_dir="./documents")
# Query the system
answer = rag.generate_answer("What is machine learning?", model="llama3.1")
print(answer)
2. Working with URLs
from takollama import VectorDB
# Initialize vector database
vdb = VectorDB("./web_db", "web_collection")
# Load data from URLs
vdb.load_data(urls_path="urls.txt")
# Query the database
results = vdb.query("How to install software?", k=5)
print(results)
3. Text Extraction Only
from takollama import TextExtractor
# Initialize text extractor
extractor = TextExtractor(
input_dir="./documents",
output_dir="./extracted_text",
urls_file="urls.txt"
)
# Extract from PDFs
pdf_files = extractor.get_pdf()
for pdf in pdf_files:
texts = extractor.extract_pdf_texts(pdf)
print(f"Extracted {len(texts)} chunks from {pdf}")
# Extract from web URLs
urls = extractor.get_urls()
for url in urls:
chunks = extractor.crawl_and_extract(url, max_depth=2)
print(f"Extracted {len(chunks)} chunks from {url}")
Core Components
RAG Class
The main interface for retrieval-augmented generation:
generate_answer(query, k=4, model="llama3.1"): Generate answers using retrieved contextgenerate_prompt(question, context): Create prompts for the LLM
VectorDB Class
Manages the vector database operations:
load_data(input_dir, output_dir, urls_path): Load data from various sourcesquery(query_text, k=5): Search for similar documentsshow_sources(): List all data sources in the databasedelete_source(source): Remove documents from a specific source
TextExtractor Class
Handles text extraction from multiple formats:
extract_pdf_texts(pdf_path): Extract text from PDF filesextract_html_text(html_path): Extract text from HTML filescrawl_and_extract(url, max_depth=1): Crawl websites and extract content
Configuration
Create a config.yaml file for your project:
vector_db:
input_dir: "./data/documents/"
output_dir: "./data/processed/"
urls_path: "./data/urls.txt"
chroma_db_dir: "./data/vector_db/"
chroma_db_name: 'my_collection'
model: "mxbai-embed-large"
ollama_model:
model_name: "llama3.1"
Advanced Usage
Custom Text Chunking
# Extract with custom chunk size and overlap
extractor = TextExtractor("./docs", "./output")
chunks = extractor.extract_html_text(
"document.html",
chars_per_file=1000,
overlap=200
)
Web Crawling with Depth Control
# Crawl website with custom parameters
chunks = extractor.crawl_and_extract(
"https://example.com",
chunk_size=800,
overlap_size=100,
max_depth=3
)
Database Management
# Check database status
vdb = VectorDB("./db", "collection")
print(f"Documents in database: {vdb.count_docs()}")
print(f"Available sources: {vdb.show_sources()}")
# Clear database
vdb.clear_database()
Supported Models
Embedding Models
mxbai-embed-large(recommended)nomic-embed-text
Language Models
Any model supported by Ollama, like:
gpt-oss:20bdeepseek-r1:8bgemma3:12bllama3.1llama3.2:3bllama3.3:70bphi4:14b
Examples
The package includes various example scripts in takollama.scripts:
RAG_query.py: Command-line RAG querying (available astakollama-query)extract_text.py: Text extraction utility (available astakollama-extract)create_rag_pipeline.py: Pipeline creation examplescreateDBlocal.py: Local database creationcreateDBColab.py: Colab-specific database setup
The notebooks/ directory contains Jupyter notebooks with detailed examples:
DNALinux_RAG_UV.ipynb: Complete RAG workflow- Various demonstration notebooks for different use cases
You can access these scripts after installation:
# Access script utilities programmatically
from takollama.scripts import RAG_query, extract_text
Command Line Tools
After installation, TakoLlama provides command-line tools for common tasks:
Query RAG Database
takollama-query \
--e_model mxbai-embed-large \
--LLM_model llama3.1 \
--db_dir ./my_db \
--db_name my_collection \
--query "Your question here" \
--k 4
Extract Text from Documents
# Extract from PDFs and HTML files
takollama-extract \
--input_dir ./documents \
--output_dir ./extracted_text \
--process_pdfs \
--process_html
# Extract from URLs
takollama-extract \
--output_dir ./web_content \
--urls_file urls.txt \
--process_urls \
--max_depth 2
# Extract from all sources
takollama-extract \
--input_dir ./documents \
--output_dir ./all_content \
--urls_file urls.txt \
--process_pdfs \
--process_html \
--process_urls \
--chunk_size 800 \
--overlap 150
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the GPL 3.0 as specified in the LICENSE file.
Requirements
bs4>=0.0.2chromadb>=1.0.17langchain-community>=0.3.27ollama>=0.5.3pypdf>=6.0.0
Support
For issues and questions, please use the GitHub issue tracker.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file takollama-0.2.0.tar.gz.
File metadata
- Download URL: takollama-0.2.0.tar.gz
- Upload date:
- Size: 41.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aaa7c549a0a352c5735e9e8b8dd62ef5b62f14b7f393e2e9fa89a29c70665914
|
|
| MD5 |
2455bef9acbff023cc0f4323350f60ae
|
|
| BLAKE2b-256 |
0661e15b1a4a836fca4394a8bdf212dec0414ad7f8621cab494596e51570f326
|
File details
Details for the file takollama-0.2.0-py3-none-any.whl.
File metadata
- Download URL: takollama-0.2.0-py3-none-any.whl
- Upload date:
- Size: 48.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1ed216656f392fa1964ef131bcce103bd53f6e75d8f0b833a9b0189d992fb35
|
|
| MD5 |
d6cd71a318cbaa6a7d3ef674941a995d
|
|
| BLAKE2b-256 |
3baa6a908a3f888f74f19bb6fb66f0dafd4adf41d685c268a98dbbd7fc026931
|