Skip to main content

Local-first RAG indexer for repos, docs, and PDFs

Project description

๐Ÿงฉ๐Ÿ”Ž fragmenter

uv PyPI Python Version Tests License Code style: ruff Conventional Commits

Build powerful RAG (Retrieval-Augmented Generation) systems with multiple LLM providers and zero configuration hassle.


โœจ Features

  • ๐Ÿค– Multiple LLM Providers: OpenAI, Anthropic, Ollama, and HuggingFace support out-of-the-box
  • ๐Ÿ”„ Smart Incremental Updates: Only processes changed files โ€” no wasted computation
  • ๐Ÿ“„ Intelligent Parsing: Automatic file-type detection for Markdown, Code, PDF, and more
  • ๐ŸŽจ Beautiful CLI: Rich formatting with colors and progress indicators
  • ๐ŸŒ Web Scraping: Built-in scraper to ingest content from websites
  • ๐Ÿ’พ Vector Store Persistence: Save and reload indexes efficiently
  • ๐Ÿ” Code Extraction: Automatically extract code blocks from LLM responses
  • โš™๏ธ Environment-Based Config: Simple .env file configuration
  • ๐Ÿš€ Zero-Code Usage: CLI tools for complete workflows without writing code
  • ๐Ÿ“ฆ Library Mode: Full programmatic API for custom integrations

๐Ÿ“ฆ Installation

Install as a CLI tool (recommended)

# Install globally as a tool
uv tool install 'fragmenter[openai]'

# Or run instantly without installing
uvx fragmenter init

Add as a project dependency

Install the core package plus the provider(s) you need:

# Pick one (or more) LLM provider extras:
uv add 'fragmenter[openai]'        # OpenAI  (default provider)
uv add 'fragmenter[anthropic]'      # Anthropic
uv add 'fragmenter[ollama]'         # Ollama  (local models)
uv add 'fragmenter[huggingface]'    # HuggingFace

# Or combine several:
uv add 'fragmenter[openai,ollama]'

# Or install everything:
uv add 'fragmenter[all-providers]'

Traditional pip install

pip install 'fragmenter[openai]'

[!NOTE] LLM provider packages are not included in the base install to keep downloads small. If you see an ImportError mentioning a missing extra, install the matching provider extra shown in the error message.


๐Ÿš€ Quick Start

Prerequisites

Before you begin, ensure you have:

  • Python: 3.12 or higher โœ…
  • API Keys: For your chosen LLM provider (OpenAI, Anthropic, etc.) ๐Ÿ”‘

1. Initialize your project

# Create .env template
fragmenter init

Edit the generated .env file with your API credentials:

# .env
OPENAI_API_KEY=sk-your-actual-key-here
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
EMBED_PROVIDER=openai
EMBED_MODEL=text-embedding-3-small

[!NOTE] See the Configuration section for all available providers and models.

2. Prepare your data

# Create data directory
mkdir data

# Add your documents (markdown, code, PDFs, etc.)
cp /path/to/your/docs/* ./data/

3. Build the index

fragmenter rebuild-index \
    --data-dir ./data \
    --storage-dir ./vector_store

What happens next? ๐ŸŽฌ

  1. ๐Ÿ“ Scans your data directory
  2. ๐Ÿ” Detects file types and applies appropriate parsers
  3. โœ‚๏ธ Chunks documents intelligently
  4. ๐Ÿงฎ Generates embeddings
  5. ๐Ÿ’พ Stores vectors for fast retrieval

4. Query your data

# Ask a question
fragmenter query \
    --storage-dir ./vector_store \
    --query "What is this data about?"

[!TIP] Save responses to files with --output and extract code with --code-only:

fragmenter query \
    -s ./vector_store \
    -q "Write a Python example" \
    -o output.py \
    --code-only \
    --language python

๐Ÿ› ๏ธ CLI Tools

init

Create a .env template file in your project.

fragmenter init

scrape

Scrape content from websites and save as markdown or HTML.

# Scrape as markdown (default)
fragmenter scrape \
    https://example.com \
    -o ./data

# Scrape as HTML
fragmenter scrape \
    https://example.com \
    -o ./data \
    --format html

rebuild_index

Build or update the RAG index with automatic incremental updates.

fragmenter rebuild-index \
    --data-dir ./data \
    --storage-dir ./vector_store

[!NOTE] Incremental updates mean only new or modified files are processed, saving time and compute resources.

query_index

Query the index with natural language.

# Basic query
fragmenter query \
    -s ./vector_store \
    -q "Your question here"

# Query from file
fragmenter query \
    -s ./vector_store \
    -f question.txt

# Save output
fragmenter query \
    -s ./vector_store \
    -q "Generate code" \
    -o output.cpp \
    --code-only \
    --language cpp

# Use different provider
fragmenter query \
    -s ./vector_store \
    -q "Explain this" \
    --llm-provider anthropic \
    --llm-model claude-3-5-sonnet-20241022

inspect_index

View index statistics and contents.

fragmenter inspect-index \
    -s ./vector_store

โš™๏ธ Configuration

All settings can be configured via environment variables. Create a .env file or set them in your shell.

LLM Providers

Provider Extra Configuration
OpenAI [openai] LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
Anthropic [anthropic] LLM_PROVIDER=anthropic
LLM_MODEL=claude-3-5-sonnet-20241022
Ollama [ollama] LLM_PROVIDER=ollama
LLM_MODEL=llama3.2
HuggingFace [huggingface] LLM_PROVIDER=huggingface
LLM_MODEL=meta-llama/Llama-3.2-3B-Instruct

Embedding Providers

Provider Configuration
OpenAI EMBED_PROVIDER=openai
EMBED_MODEL=text-embedding-3-small
HuggingFace EMBED_PROVIDER=huggingface
EMBED_MODEL=BAAI/bge-small-en-v1.5
Ollama EMBED_PROVIDER=ollama
EMBED_MODEL=nomic-embed-text

Complete .env Example

# LLM Configuration
OPENAI_API_KEY=sk-your-key-here
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini

# Embedding Configuration
EMBED_PROVIDER=openai
EMBED_MODEL=text-embedding-3-small

# Optional: Anthropic
ANTHROPIC_API_KEY=sk-ant-your-key-here

# Optional: HuggingFace
HUGGINGFACE_TOKEN=hf_your-token-here

[!CAUTION] Never commit your .env file to version control! Add it to .gitignore to protect your API keys.


๐Ÿ’ป Using as a Library

If you need custom logic or want to integrate into your own application:

from dotenv import load_dotenv
from fragmenter.config import RAGSettings
from fragmenter.rag.ingestion import build_index
from fragmenter.rag.inference import load_index, query_index

# Load configuration
load_dotenv()
settings = RAGSettings()
settings.configure_llm_settings()

# Build index
build_index(input_dir="./data", persist_dir="./vector_store")

# Query
index = load_index("./vector_store")
response = query_index(index, "Your question")
print(response)

๐ŸŒฑ Usage Examples

Example 1: Documentation RAG

Build a RAG system for your project documentation:

# 1. Scrape your docs site
fragmenter scrape \
    https://docs.example.com \
    -o ./data/docs

# 2. Build the index
fragmenter rebuild-index \
    -d ./data \
    -s ./vector_store

# 3. Query
fragmenter query \
    -s ./vector_store \
    -q "How do I configure authentication?"

Example 2: Code Analysis

Analyze a codebase and generate examples:

# 1. Copy code files to data directory
cp -r /path/to/project/src ./data/

# 2. Build index
fragmenter rebuild-index -d ./data -s ./vector_store

# 3. Generate code examples
fragmenter query \
    -s ./vector_store \
    -q "Show me how to use the authentication module" \
    -o example.py \
    --code-only \
    --language python

Example 3: Research Assistant

Build a research assistant for papers and articles:

# 1. Add PDFs and markdown files to data/
# 2. Build index
fragmenter rebuild-index -d ./data -s ./vector_store

# 3. Query with different providers
fragmenter query \
    -s ./vector_store \
    -q "Summarize the key findings about neural networks" \
    --llm-provider anthropic \
    --llm-model claude-3-5-sonnet-20241022

[!TIP] See examples/waywise for a complete real-world example with custom configuration.


๐Ÿ”ง Troubleshooting

๏ฟฝ Missing Provider Errors

[!WARNING] If you see an ImportError like "โ€ฆrequires the 'openai' extra":

uv add 'fragmenter[openai]'   # install the provider you need

See the LLM Providers table for all available extras.

๏ฟฝ๐Ÿ” Authentication Errors

[!WARNING] If you encounter authentication errors:

  • โœ… Verify your API key is correct and not expired
  • โœ… Check that you've set the correct provider name (openai, not OpenAI)
  • โœ… Ensure API key environment variable names match your provider
  • โœ… Run fragmenter init to generate a fresh .env template

๐Ÿ“ File Parsing Issues

[!NOTE] If certain files aren't being indexed:

  • Check file extensions are supported (.md, .py, .pdf, .txt, etc.)
  • Verify files are in the --data-dir path
  • Use --log-level DEBUG to see detailed parsing information
  • Check file permissions (files must be readable)

๐Ÿ’พ Vector Store Errors

[!TIP] If you see vector store errors:

  • Delete the ./vector_store directory and rebuild from scratch
  • Ensure you have write permissions in the storage directory
  • Check available disk space
  • Verify embedding model is properly configured

๐ŸŒ Provider-Specific Issues

Ollama:

# Ensure Ollama is running
ollama serve

# Pull the model first
ollama pull llama3.2

HuggingFace:

  • Set HUGGINGFACE_TOKEN for private models
  • Some models require acceptance of terms on HuggingFace website

๐Ÿ› ๏ธ Development

Setup

git clone https://github.com/RISE-Dependable-Transport-Systems/fragmenter.git
cd fragmenter
uv sync --all-groups

Common Tasks

just lint              # Run all linters via pre-commit
just fmt               # Auto-format code
just test              # Run unit tests
just test-cov          # Run tests with coverage
just build             # Build sdist and wheel
just check-all         # Lint + test
just all               # Full pipeline: clean โ†’ install โ†’ lint โ†’ test โ†’ build โ†’ verify โ†’ install-test

๐Ÿ“– Examples

  • Complete Real-World Example: See examples/waywise for a full setup with custom data, configuration, and evaluation scripts.
  • Developer Example: See examples/dev_examples/main.py for a programmatic usage demonstration of the RAG framework.

๐Ÿ™Œ Contributing

Contributions welcome! Please ensure:

  • โœ… Code is formatted (just fmt)
  • โœ… All linters pass (just lint)
  • โœ… Tests pass (just test)
  • โœ… New features include tests and documentation
  • ๐Ÿ”’ No API keys or secrets in commits

๐Ÿ“„ License

MIT License โ€” see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fragmenter-0.1.1.tar.gz (40.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fragmenter-0.1.1-py3-none-any.whl (46.9 kB view details)

Uploaded Python 3

File details

Details for the file fragmenter-0.1.1.tar.gz.

File metadata

  • Download URL: fragmenter-0.1.1.tar.gz
  • Upload date:
  • Size: 40.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for fragmenter-0.1.1.tar.gz
Algorithm Hash digest
SHA256 30f79d08a69eef1277c4a7fa03f775cb7a6933ba828ff48d35ae737fde159022
MD5 ef1ee0414bf792e8353ddd4619f2d1ed
BLAKE2b-256 0e10ad820e354dceb1d031906fa2d1dabf9293575334eb1c079814bdf163c26c

See more details on using hashes here.

File details

Details for the file fragmenter-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fragmenter-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 46.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for fragmenter-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 39f82cac025c52f0589eeb76b995fd16af2c3a1ed89123544721d6deaeae4b43
MD5 f02e68cfc5126443682537681a1fc861
BLAKE2b-256 aaef0a997c0963656703246f64c0dac2840d5839fec5c59fb5050c44baaa43b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page