RAG library

Project description

TakoLlama

A Python library for building Retrieval-Augmented Generation (RAG) systems with Ollama and ChromaDB. TakoLlama simplifies the process of extracting text from various sources, creating vector embeddings, and querying them with large language models.

Features

Multi-format Text Extraction: Extract text from PDFs, HTML files, and web URLs
Vector Database Management: Built-in ChromaDB integration for efficient similarity search
Ollama Integration: Seamless integration with Ollama for embeddings and text generation
Web Crawling: Intelligent web crawling with relevance filtering
Flexible Data Processing: Configurable text chunking with overlap for better context preservation

Installation

pip install takollama

Prerequisites

Python 3.12+
Ollama installed and running
Required models pulled in Ollama (e.g., ollama pull llama3.1, ollama pull mxbai-embed-large)

Quick Start

1. Basic RAG Pipeline

from takollama import RAG, VectorDB

# Initialize the RAG system
rag = RAG(
    chroma_db_dir="./my_db",
    chroma_db_name="my_collection",
    v_model="mxbai-embed-large"
)

# Load data from PDFs and HTML files
rag.vector_db.load_data(input_dir="./documents")

# Query the system
answer = rag.generate_answer("What is machine learning?", model="llama3.1")
print(answer)

2. Working with URLs

from takollama import VectorDB

# Initialize vector database
vdb = VectorDB("./web_db", "web_collection")

# Load data from URLs
vdb.load_data(urls_path="urls.txt")

# Query the database
results = vdb.query("How to install software?", k=5)
print(results)

3. Text Extraction Only

from takollama import TextExtractor

# Initialize text extractor
extractor = TextExtractor(
    input_dir="./documents",
    output_dir="./extracted_text",
    urls_file="urls.txt"
)

# Extract from PDFs
pdf_files = extractor.get_pdf()
for pdf in pdf_files:
    texts = extractor.extract_pdf_texts(pdf)
    print(f"Extracted {len(texts)} chunks from {pdf}")

# Extract from web URLs
urls = extractor.get_urls()
for url in urls:
    chunks = extractor.crawl_and_extract(url, max_depth=2)
    print(f"Extracted {len(chunks)} chunks from {url}")

Core Components

RAG Class

The main interface for retrieval-augmented generation:

generate_answer(query, k=4, model="llama3.1"): Generate answers using retrieved context
generate_prompt(question, context): Create prompts for the LLM

VectorDB Class

Manages the vector database operations:

load_data(input_dir, output_dir, urls_path): Load data from various sources
query(query_text, k=5): Search for similar documents
show_sources(): List all data sources in the database
delete_source(source): Remove documents from a specific source

TextExtractor Class

Handles text extraction from multiple formats:

extract_pdf_texts(pdf_path): Extract text from PDF files
extract_html_text(html_path): Extract text from HTML files
crawl_and_extract(url, max_depth=1): Crawl websites and extract content

Configuration

Create a config.yaml file for your project:

vector_db:
  input_dir: "./data/documents/"
  output_dir: "./data/processed/"
  urls_path: "./data/urls.txt"
  chroma_db_dir: "./data/vector_db/"
  chroma_db_name: 'my_collection'
  model: "mxbai-embed-large"

ollama_model:
  model_name: "llama3.1"

Advanced Usage

Custom Text Chunking

# Extract with custom chunk size and overlap
extractor = TextExtractor("./docs", "./output")
chunks = extractor.extract_html_text(
    "document.html", 
    chars_per_file=1000, 
    overlap=200
)

Web Crawling with Depth Control

# Crawl website with custom parameters
chunks = extractor.crawl_and_extract(
    "https://example.com",
    chunk_size=800,
    overlap_size=100,
    max_depth=3
)

Database Management

# Check database status
vdb = VectorDB("./db", "collection")
print(f"Documents in database: {vdb.count_docs()}")
print(f"Available sources: {vdb.show_sources()}")

# Clear database
vdb.clear_database()

Supported Models

Embedding Models

mxbai-embed-large (recommended)
nomic-embed-text

Language Models

Any model supported by Ollama, like:

gpt-oss:20b
deepseek-r1:8b
gemma3:12b
llama3.1
llama3.2:3b
llama3.3:70b
phi4:14b

Examples

The package includes various example scripts in takollama.scripts:

RAG_query.py: Command-line RAG querying (available as takollama-query)
extract_text.py: Text extraction utility (available as takollama-extract)
create_rag_pipeline.py: Pipeline creation examples
createDBlocal.py: Local database creation
createDBColab.py: Colab-specific database setup

The notebooks/ directory contains Jupyter notebooks with detailed examples:

DNALinux_RAG_UV.ipynb: Complete RAG workflow
Various demonstration notebooks for different use cases

You can access these scripts after installation:

# Access script utilities programmatically
from takollama.scripts import RAG_query, extract_text

Command Line Tools

After installation, TakoLlama provides command-line tools for common tasks:

Query RAG Database

takollama-query \
  --e_model mxbai-embed-large \
  --LLM_model llama3.1 \
  --db_dir ./my_db \
  --db_name my_collection \
  --query "Your question here" \
  --k 4

Extract Text from Documents

# Extract from PDFs and HTML files
takollama-extract \
  --input_dir ./documents \
  --output_dir ./extracted_text \
  --process_pdfs \
  --process_html

# Extract from URLs
takollama-extract \
  --output_dir ./web_content \
  --urls_file urls.txt \
  --process_urls \
  --max_depth 2

# Extract from all sources
takollama-extract \
  --input_dir ./documents \
  --output_dir ./all_content \
  --urls_file urls.txt \
  --process_pdfs \
  --process_html \
  --process_urls \
  --chunk_size 800 \
  --overlap 150

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the GPL 3.0 as specified in the LICENSE file.

Requirements

bs4>=0.0.2
chromadb>=1.0.17
langchain-community>=0.3.27
ollama>=0.5.3
pypdf>=6.0.0

Support

For issues and questions, please use the GitHub issue tracker.

Project details

Release history Release notifications | RSS feed

0.2.2

Aug 17, 2025

0.2.1

Aug 17, 2025

This version

0.2.0

Aug 17, 2025

0.1.0

Aug 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

takollama-0.2.0.tar.gz (41.8 kB view details)

Uploaded Aug 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

takollama-0.2.0-py3-none-any.whl (48.6 kB view details)

Uploaded Aug 17, 2025 Python 3

File details

Details for the file takollama-0.2.0.tar.gz.

File metadata

Download URL: takollama-0.2.0.tar.gz
Upload date: Aug 17, 2025
Size: 41.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.15

File hashes

Hashes for takollama-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`aaa7c549a0a352c5735e9e8b8dd62ef5b62f14b7f393e2e9fa89a29c70665914`
MD5	`2455bef9acbff023cc0f4323350f60ae`
BLAKE2b-256	`0661e15b1a4a836fca4394a8bdf212dec0414ad7f8621cab494596e51570f326`

See more details on using hashes here.

File details

Details for the file takollama-0.2.0-py3-none-any.whl.

File metadata

Download URL: takollama-0.2.0-py3-none-any.whl
Upload date: Aug 17, 2025
Size: 48.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.15

File hashes

Hashes for takollama-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1ed216656f392fa1964ef131bcce103bd53f6e75d8f0b833a9b0189d992fb35`
MD5	`d6cd71a318cbaa6a7d3ef674941a995d`
BLAKE2b-256	`3baa6a908a3f888f74f19bb6fb66f0dafd4adf41d685c268a98dbbd7fc026931`

See more details on using hashes here.

takollama 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TakoLlama

Features

Installation

Prerequisites

Quick Start

1. Basic RAG Pipeline

2. Working with URLs

3. Text Extraction Only

Core Components

RAG Class

VectorDB Class

TextExtractor Class

Configuration

Advanced Usage

Custom Text Chunking

Web Crawling with Depth Control

Database Management

Supported Models

Embedding Models

Language Models

Examples

Command Line Tools

Query RAG Database

Extract Text from Documents

Contributing

License

Requirements

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes