Local-first RAG indexer for repos, docs, and PDFs
Project description
๐งฉ๐ fragmenter
Build powerful RAG (Retrieval-Augmented Generation) systems with multiple LLM providers and zero configuration hassle.
โจ Features
- ๐ค Multiple LLM Providers: OpenAI, Anthropic, Ollama, and HuggingFace support out-of-the-box
- ๐ Smart Incremental Updates: Only processes changed files โ no wasted computation
- ๐ Intelligent Parsing: Automatic file-type detection for Markdown, Code, PDF, and more
- ๐จ Beautiful CLI: Rich formatting with colors and progress indicators
- ๐ Web Scraping: Built-in scraper to ingest content from websites
- ๐พ Vector Store Persistence: Save and reload indexes efficiently
- ๐ Code Extraction: Automatically extract code blocks from LLM responses
- โ๏ธ Environment-Based Config: Simple
.envfile configuration - ๐ Zero-Code Usage: CLI tools for complete workflows without writing code
- ๐ฆ Library Mode: Full programmatic API for custom integrations
๐ฆ Installation
Install as a CLI tool (recommended)
# Install globally as a tool
uv tool install 'fragmenter[openai]'
# Or run instantly without installing
uvx fragmenter init
Add as a project dependency
Install the core package plus the provider(s) you need:
# Pick one (or more) LLM provider extras:
uv add 'fragmenter[openai]' # OpenAI (default provider)
uv add 'fragmenter[anthropic]' # Anthropic
uv add 'fragmenter[ollama]' # Ollama (local models)
uv add 'fragmenter[huggingface]' # HuggingFace
# Or combine several:
uv add 'fragmenter[openai,ollama]'
# Or install everything:
uv add 'fragmenter[all-providers]'
Traditional pip install
pip install 'fragmenter[openai]'
[!NOTE] LLM provider packages are not included in the base install to keep downloads small. If you see an
ImportErrormentioning a missing extra, install the matching provider extra shown in the error message.
๐ Quick Start
Prerequisites
Before you begin, ensure you have:
- Python: 3.12 or higher โ
- API Keys: For your chosen LLM provider (OpenAI, Anthropic, etc.) ๐
1. Initialize your project
# Create .env template
fragmenter init
Edit the generated .env file with your API credentials:
# .env
OPENAI_API_KEY=sk-your-actual-key-here
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
EMBED_PROVIDER=openai
EMBED_MODEL=text-embedding-3-small
[!NOTE] See the Configuration section for all available providers and models.
2. Prepare your data
# Create data directory
mkdir data
# Add your documents (markdown, code, PDFs, etc.)
cp /path/to/your/docs/* ./data/
3. Build the index
fragmenter rebuild-index \
--data-dir ./data \
--storage-dir ./vector_store
What happens next? ๐ฌ
- ๐ Scans your data directory
- ๐ Detects file types and applies appropriate parsers
- โ๏ธ Chunks documents intelligently
- ๐งฎ Generates embeddings
- ๐พ Stores vectors for fast retrieval
4. Query your data
# Ask a question
fragmenter query \
--storage-dir ./vector_store \
--query "What is this data about?"
[!TIP] Save responses to files with
--outputand extract code with--code-only:fragmenter query \ -s ./vector_store \ -q "Write a Python example" \ -o output.py \ --code-only \ --language python
๐ ๏ธ CLI Tools
init
Create a .env template file in your project.
fragmenter init
scrape
Scrape content from websites and save as markdown or HTML.
# Scrape as markdown (default)
fragmenter scrape \
https://example.com \
-o ./data
# Scrape as HTML
fragmenter scrape \
https://example.com \
-o ./data \
--format html
rebuild_index
Build or update the RAG index with automatic incremental updates.
fragmenter rebuild-index \
--data-dir ./data \
--storage-dir ./vector_store
[!NOTE] Incremental updates mean only new or modified files are processed, saving time and compute resources.
query_index
Query the index with natural language.
# Basic query
fragmenter query \
-s ./vector_store \
-q "Your question here"
# Query from file
fragmenter query \
-s ./vector_store \
-f question.txt
# Save output
fragmenter query \
-s ./vector_store \
-q "Generate code" \
-o output.cpp \
--code-only \
--language cpp
# Use different provider
fragmenter query \
-s ./vector_store \
-q "Explain this" \
--llm-provider anthropic \
--llm-model claude-3-5-sonnet-20241022
inspect_index
View index statistics and contents.
fragmenter inspect-index \
-s ./vector_store
โ๏ธ Configuration
All settings can be configured via environment variables. Create a .env file or set them in your shell.
LLM Providers
| Provider | Extra | Configuration |
|---|---|---|
| OpenAI | [openai] |
LLM_PROVIDER=openaiLLM_MODEL=gpt-4o-mini |
| Anthropic | [anthropic] |
LLM_PROVIDER=anthropicLLM_MODEL=claude-3-5-sonnet-20241022 |
| Ollama | [ollama] |
LLM_PROVIDER=ollamaLLM_MODEL=llama3.2 |
| HuggingFace | [huggingface] |
LLM_PROVIDER=huggingfaceLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct |
Embedding Providers
| Provider | Configuration |
|---|---|
| OpenAI | EMBED_PROVIDER=openaiEMBED_MODEL=text-embedding-3-small |
| HuggingFace | EMBED_PROVIDER=huggingfaceEMBED_MODEL=BAAI/bge-small-en-v1.5 |
| Ollama | EMBED_PROVIDER=ollamaEMBED_MODEL=nomic-embed-text |
Complete .env Example
# LLM Configuration
OPENAI_API_KEY=sk-your-key-here
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
# Embedding Configuration
EMBED_PROVIDER=openai
EMBED_MODEL=text-embedding-3-small
# Optional: Anthropic
ANTHROPIC_API_KEY=sk-ant-your-key-here
# Optional: HuggingFace
HUGGINGFACE_TOKEN=hf_your-token-here
[!CAUTION] Never commit your
.envfile to version control! Add it to.gitignoreto protect your API keys.
๐ป Using as a Library
If you need custom logic or want to integrate into your own application:
from dotenv import load_dotenv
from fragmenter.config import RAGSettings
from fragmenter.rag.ingestion import build_index
from fragmenter.rag.inference import load_index, query_index
# Load configuration
load_dotenv()
settings = RAGSettings()
settings.configure_llm_settings()
# Build index
build_index(input_dir="./data", persist_dir="./vector_store")
# Query
index = load_index("./vector_store")
response = query_index(index, "Your question")
print(response)
๐ฑ Usage Examples
Example 1: Documentation RAG
Build a RAG system for your project documentation:
# 1. Scrape your docs site
fragmenter scrape \
https://docs.example.com \
-o ./data/docs
# 2. Build the index
fragmenter rebuild-index \
-d ./data \
-s ./vector_store
# 3. Query
fragmenter query \
-s ./vector_store \
-q "How do I configure authentication?"
Example 2: Code Analysis
Analyze a codebase and generate examples:
# 1. Copy code files to data directory
cp -r /path/to/project/src ./data/
# 2. Build index
fragmenter rebuild-index -d ./data -s ./vector_store
# 3. Generate code examples
fragmenter query \
-s ./vector_store \
-q "Show me how to use the authentication module" \
-o example.py \
--code-only \
--language python
Example 3: Research Assistant
Build a research assistant for papers and articles:
# 1. Add PDFs and markdown files to data/
# 2. Build index
fragmenter rebuild-index -d ./data -s ./vector_store
# 3. Query with different providers
fragmenter query \
-s ./vector_store \
-q "Summarize the key findings about neural networks" \
--llm-provider anthropic \
--llm-model claude-3-5-sonnet-20241022
[!TIP] See examples/waywise for a complete real-world example with custom configuration.
๐ง Troubleshooting
๏ฟฝ Missing Provider Errors
[!WARNING] If you see an
ImportErrorlike "โฆrequires the 'openai' extra":uv add 'fragmenter[openai]' # install the provider you needSee the LLM Providers table for all available extras.
๏ฟฝ๐ Authentication Errors
[!WARNING] If you encounter authentication errors:
- โ Verify your API key is correct and not expired
- โ Check that you've set the correct provider name (
openai, notOpenAI)- โ Ensure API key environment variable names match your provider
- โ Run
fragmenter initto generate a fresh.envtemplate
๐ File Parsing Issues
[!NOTE] If certain files aren't being indexed:
- Check file extensions are supported (
.md,.py,.txt, etc.)- Verify files are in the
--data-dirpath- Use
--log-level DEBUGto see detailed parsing information- Check file permissions (files must be readable)
๐พ Vector Store Errors
[!TIP] If you see vector store errors:
- Delete the
./vector_storedirectory and rebuild from scratch- Ensure you have write permissions in the storage directory
- Check available disk space
- Verify embedding model is properly configured
๐ Provider-Specific Issues
Ollama:
# Ensure Ollama is running
ollama serve
# Pull the model first
ollama pull llama3.2
HuggingFace:
- Set
HUGGINGFACE_TOKENfor private models - Some models require acceptance of terms on HuggingFace website
๐ ๏ธ Development
Setup
git clone https://github.com/RISE-Dependable-Transport-Systems/fragmenter.git
cd fragmenter
uv sync --all-groups
Common Tasks
just lint # Run all linters via pre-commit
just fmt # Auto-format code
just test # Run unit tests
just test-cov # Run tests with coverage
just build # Build sdist and wheel
just check-all # Lint + test
just all # Full pipeline: clean โ install โ lint โ test โ build โ verify โ install-test
๐ Examples
- Complete Real-World Example: See examples/waywise for a full setup with custom data, configuration, and evaluation scripts.
- Developer Example: See examples/dev_examples/main.py for a programmatic usage demonstration of the RAG framework.
๐ Contributing
Contributions welcome! Please ensure:
- โ
Code is formatted (
just fmt) - โ
All linters pass (
just lint) - โ
Tests pass (
just test) - โ New features include tests and documentation
- ๐ No API keys or secrets in commits
๐ License
MIT License โ see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fragmenter-0.1.1.tar.gz.
File metadata
- Download URL: fragmenter-0.1.1.tar.gz
- Upload date:
- Size: 40.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30f79d08a69eef1277c4a7fa03f775cb7a6933ba828ff48d35ae737fde159022
|
|
| MD5 |
ef1ee0414bf792e8353ddd4619f2d1ed
|
|
| BLAKE2b-256 |
0e10ad820e354dceb1d031906fa2d1dabf9293575334eb1c079814bdf163c26c
|
File details
Details for the file fragmenter-0.1.1-py3-none-any.whl.
File metadata
- Download URL: fragmenter-0.1.1-py3-none-any.whl
- Upload date:
- Size: 46.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39f82cac025c52f0589eeb76b995fd16af2c3a1ed89123544721d6deaeae4b43
|
|
| MD5 |
f02e68cfc5126443682537681a1fc861
|
|
| BLAKE2b-256 |
aaef0a997c0963656703246f64c0dac2840d5839fec5c59fb5050c44baaa43b1
|