Skip to main content

RAG library

Project description

TakoLlama

A Python library for building Retrieval-Augmented Generation (RAG) systems with Ollama and ChromaDB. TakoLlama simplifies the process of extracting text from various sources, creating vector embeddings, and querying them with large language models.

Features

  • Multi-format Text Extraction: Extract text from PDFs, HTML files, and web URLs
  • Vector Database Management: Built-in ChromaDB integration for efficient similarity search
  • Ollama Integration: Seamless integration with Ollama for embeddings and text generation
  • Web Crawling: Intelligent web crawling with relevance filtering
  • Flexible Data Processing: Configurable text chunking with overlap for better context preservation

Installation

pip install takollama

Prerequisites

  • Python 3.12+
  • Ollama installed and running
  • Required models pulled in Ollama (e.g., ollama pull llama3.1, ollama pull mxbai-embed-large)

Quick Start

1. Basic RAG Pipeline

from takollama import RAG, VectorDB

# Initialize the RAG system
rag = RAG(
    chroma_db_dir="./my_db",
    chroma_db_name="my_collection",
    v_model="mxbai-embed-large"
)

# Load data from PDFs and HTML files
rag.vector_db.load_data(input_dir="./documents")

# Query the system
answer = rag.generate_answer("What is machine learning?", model="llama3.1")
print(answer)

2. Working with URLs

from takollama import VectorDB

# Initialize vector database
vdb = VectorDB("./web_db", "web_collection")

# Load data from URLs
vdb.load_data(urls_path="urls.txt")

# Query the database
results = vdb.query("How to install software?", k=5)
print(results)

3. Text Extraction Only

from takollama import TextExtractor

# Initialize text extractor
extractor = TextExtractor(
    input_dir="./documents",
    output_dir="./extracted_text",
    urls_file="urls.txt"
)

# Extract from PDFs
pdf_files = extractor.get_pdf()
for pdf in pdf_files:
    texts = extractor.extract_pdf_texts(pdf)
    print(f"Extracted {len(texts)} chunks from {pdf}")

# Extract from web URLs
urls = extractor.get_urls()
for url in urls:
    chunks = extractor.crawl_and_extract(url, max_depth=2)
    print(f"Extracted {len(chunks)} chunks from {url}")

Core Components

RAG Class

The main interface for retrieval-augmented generation:

  • generate_answer(query, k=4, model="llama3.1"): Generate answers using retrieved context
  • generate_prompt(question, context): Create prompts for the LLM

VectorDB Class

Manages the vector database operations:

  • load_data(input_dir, output_dir, urls_path): Load data from various sources
  • query(query_text, k=5): Search for similar documents
  • show_sources(): List all data sources in the database
  • delete_source(source): Remove documents from a specific source

TextExtractor Class

Handles text extraction from multiple formats:

  • extract_pdf_texts(pdf_path): Extract text from PDF files
  • extract_html_text(html_path): Extract text from HTML files
  • crawl_and_extract(url, max_depth=1): Crawl websites and extract content

Configuration

Create a config.yaml file for your project:

vector_db:
  input_dir: "./data/documents/"
  output_dir: "./data/processed/"
  urls_path: "./data/urls.txt"
  chroma_db_dir: "./data/vector_db/"
  chroma_db_name: 'my_collection'
  model: "mxbai-embed-large"

ollama_model:
  model_name: "llama3.1"

Advanced Usage

Custom Text Chunking

# Extract with custom chunk size and overlap
extractor = TextExtractor("./docs", "./output")
chunks = extractor.extract_html_text(
    "document.html", 
    chars_per_file=1000, 
    overlap=200
)

Web Crawling with Depth Control

# Crawl website with custom parameters
chunks = extractor.crawl_and_extract(
    "https://example.com",
    chunk_size=800,
    overlap_size=100,
    max_depth=3
)

Database Management

# Check database status
vdb = VectorDB("./db", "collection")
print(f"Documents in database: {vdb.count_docs()}")
print(f"Available sources: {vdb.show_sources()}")

# Clear database
vdb.clear_database()

Supported Models

Embedding Models

  • mxbai-embed-large (recommended)
  • nomic-embed-text

Language Models

Any model supported by Ollama, like:

  • gpt-oss:20b
  • deepseek-r1:8b
  • gemma3:12b
  • llama3.1
  • llama3.2:3b
  • llama3.3:70b
  • phi4:14b

Examples

The package includes various example scripts in takollama.scripts:

  • RAG_query.py: Command-line RAG querying (available as takollama-query)
  • extract_text.py: Text extraction utility (available as takollama-extract)
  • create_rag_pipeline.py: Pipeline creation examples
  • createDBlocal.py: Local database creation
  • createDBColab.py: Colab-specific database setup

The notebooks/ directory contains Jupyter notebooks with detailed examples:

  • DNALinux_RAG_UV.ipynb: Complete RAG workflow
  • Various demonstration notebooks for different use cases

You can access these scripts after installation:

# Access script utilities programmatically
from takollama.scripts import RAG_query, extract_text

Command Line Tools

After installation, TakoLlama provides command-line tools for common tasks:

Query RAG Database

takollama-query \
  --e_model mxbai-embed-large \
  --LLM_model llama3.1 \
  --db_dir ./my_db \
  --db_name my_collection \
  --query "Your question here" \
  --k 4

Extract Text from Documents

# Extract from PDFs and HTML files
takollama-extract \
  --input_dir ./documents \
  --output_dir ./extracted_text \
  --process_pdfs \
  --process_html

# Extract from URLs
takollama-extract \
  --output_dir ./web_content \
  --urls_file urls.txt \
  --process_urls \
  --max_depth 2

# Extract from all sources
takollama-extract \
  --input_dir ./documents \
  --output_dir ./all_content \
  --urls_file urls.txt \
  --process_pdfs \
  --process_html \
  --process_urls \
  --chunk_size 800 \
  --overlap 150

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the GPL 3.0 as specified in the LICENSE file.

Requirements

  • bs4>=0.0.2
  • chromadb>=1.0.17
  • langchain-community>=0.3.27
  • ollama>=0.5.3
  • pypdf>=6.0.0

Support

For issues and questions, please use the GitHub issue tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

takollama-0.2.2.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

takollama-0.2.2-py3-none-any.whl (48.6 kB view details)

Uploaded Python 3

File details

Details for the file takollama-0.2.2.tar.gz.

File metadata

  • Download URL: takollama-0.2.2.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.15

File hashes

Hashes for takollama-0.2.2.tar.gz
Algorithm Hash digest
SHA256 9900fd16a7eacd47e901fe93acc9153f83667579c87d732f108b6be9a6a08e67
MD5 b3230c697fc70024d224faba09c0bc2e
BLAKE2b-256 d7ab787a504031834fc025df21ff2cd1c84bbe71cc60a7aaa9ce5d59ecea4331

See more details on using hashes here.

File details

Details for the file takollama-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: takollama-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 48.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.15

File hashes

Hashes for takollama-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ec4ff150811c29ff931a28f23bd0ef032982df3fee767d9b828bd5c23666c626
MD5 64c5ec0516054c5a91a4fd5777908344
BLAKE2b-256 934614aeb843963181fda6f8c8408e302df6f5a3ae0c189b3c8d44f99cd4c2bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page