A document ingestion and RAG query system with FAISS indexing and OCR support

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language

Project description

PyRagix

A clean, typed, Pythonic pipeline for Retrieval-Augmented Generation (RAG). Ingest HTML, PDF, and image-based documents, build a FAISS vector store, and search with ease using Ollama for answer generation. Designed for developers learning RAG, vector search, and document processing in Python.

PyRagix is a lightweight, educational project to help you explore how to process diverse documents (HTML, PDF, images) and enable intelligent search using modern AI tools. It's tuned for modest hardware (e.g., 16GB RAM / 6GB VRAM) with memory optimizations, but can be customized via settings.json. This project is meant to be a practical, well-structured example for Python developers diving into RAG.

Features

Cross-Platform: Runs natively on Windows, Linux, and macOS with identical functionality. Uses pathlib for universal file handling.
Document Ingestion: Extract text from HTML, PDF, and images using PaddleOCR for OCR fallback, PyMuPDF for PDFs, and BeautifulSoup for HTML.
Vector Store: Build a FAISS index with Sentence Transformers embeddings. Supports both Flat and IVF (Inverted File) indexing for optimal performance scaling.
Console Search: Query your document collection via an interactive command-line interface, with Ollama generating human-like answers from retrieved contexts.
Web Interface: Modern, responsive web UI for searching documents with real-time status indicators, configurable options, and beautiful results presentation.
Pythonic Design: Clean, typed, idiomatic Python code with protocols, context managers, and memory cleanup for clarity and maintainability.
Memory Optimizations: Adaptive memory settings based on system RAM, tiled OCR for large pages, batch embedding with retry logic, and automatic garbage collection.
Modular Architecture: Separate classes for OCR processing and configuration management for better code organization and testing.
Advanced Indexing: Configurable FAISS indexing with IVF support for faster search on large datasets, with intelligent fallback for robust operation.
Hybrid CPU/GPU Support: Automatic detection of GPU FAISS capabilities with graceful fallback to CPU-only operation for universal compatibility.
Modern Web Interface: Complete TypeScript/FastAPI web application with professional dark theme, real-time search, and responsive design.

Project Structure

PyRagix/
├── ingest_folder.py        # Main ingestion script
├── query_rag.py           # RAG query interface (console)
├── web_server.py          # FastAPI web server
├── start_web.bat          # Web interface startup script
├── config.py              # Configuration loader and validation
├── settings.json          # User configuration file (auto-generated)
├── classes/
│   ├── ProcessingConfig.py # Data class for processing configuration
│   └── OCRProcessor.py     # OCR operations handler
├── web/                   # Web interface files
│   ├── index.html         # Main web interface
│   ├── style.css          # Modern dark theme styling
│   ├── script.ts          # TypeScript source (ES2024)
│   ├── script.js          # Compiled JavaScript
│   ├── tsconfig.json      # TypeScript configuration
│   └── dev.bat           # TypeScript development script
├── requirements.in         # Package dependencies (source)
├── requirements.txt        # Compiled dependencies
├── local_faiss.index      # Generated FAISS vector index
├── documents.pkl          # Document metadata
├── processed_files.txt    # Log of processed files
├── ingestion.log         # Processing logs
└── crash_log.txt         # Error logs (when failures occur)

Installation

Clone the Repository:

git clone https://github.com/<your-username>/PyRagix.git
cd PyRagix

Set Up a Virtual Environment (recommended):

# Linux/Mac
python -m venv venv
source venv/bin/activate

# Windows
python -m venv rag-env
rag-env\Scripts\activate.bat

Install Dependencies: PyRagix uses a requirements.in file for dependency management. Ensure you have pip and pip-tools installed, then run:
```
pip install pip-tools
pip-compile requirements.in  # Generates requirements.txt
pip install -r requirements.txt
```
Note: The dependency list includes torch, transformers, faiss-cpu, paddleocr, paddlepaddle, sentence-transformers, fitz (PyMuPDF), fastapi, uvicorn, and others. Ensure you have sufficient disk space and a compatible Python version (3.8+ recommended). For GPU acceleration, install CUDA-enabled versions where applicable.
Ollama Setup (for Querying):
- Install Ollama: Follow instructions at ollama.com.
- Pull the default model: ollama pull llama3.1:8b-instruct-q4_0.
- Start the Ollama server: ollama serve.
Customize the Ollama model or URL in query_rag.py if needed.

Usage

PyRagix provides both console and web interfaces for document search:

ingest_folder.py: Processes a folder of documents (HTML, PDF, images) and builds a FAISS vector store.
query_rag.py: Interactive console-based search interface.
web_server.py: Modern web interface with REST API backend.

Step 1: Ingest Documents

Run the ingestion script to process a folder and create a FAISS index:

python ingest_folder.py [path/to/documents]

If no folder is provided, it uses the default from config.py (e.g., ./docs).
Supported formats: PDF, HTML/HTM, images (via OCR).
Outputs: local_faiss.index (FAISS index), documents.pkl (metadata), processed_files.txt (processed file log), ingestion.log (processing log), and crash_log.txt (errors if any).
Resumes from existing index if available; skips already processed files.

Customization: Edit settings.json for hardware tuning (e.g., batch size, thread counts, index type). The file is auto-generated on first run with optimal defaults for your system. IVF indexing is enabled by default for better performance scaling.

Example:

python ingest_folder.py ./my_documents

This scans ./my_documents and subfolders, extracts text (with OCR fallback for images/scans), chunks it, embeds with all-MiniLM-L6-v2, and adds to a FAISS IVF index optimized for fast retrieval.

Step 2: Search Documents

PyRagix offers two search interfaces:

Option A: Web Interface (Recommended)

Launch the modern web interface:

# Windows (using convenience script)
start_web.bat

# Linux/Mac/Windows (direct command)
python web_server.py

Then open your browser to:

Web Interface: http://localhost:8000/web/
API Documentation: http://localhost:8000/docs
Health Check: http://localhost:8000/health

Web Interface Features:

Modern, responsive dark theme design
Real-time server status indicator
Configurable search options (results count, sources, debug mode)
Beautiful answer presentation with source highlighting
TypeScript-powered frontend with ES2024 features
REST API backend for integration

Option B: Console Interface

Launch the interactive console-based search interface:

python query_rag.py

Loads the FAISS index and metadata.
Enter queries at the prompt; get generated answers from Ollama based on retrieved contexts.
Shows sources with scores and chunk indices.
Type 'quit' or 'exit' to stop.

Example Interaction:

Query: What is machine learning?

Answer:
===========
Machine learning is a subset of AI that focuses on building systems that learn from data...
(Generated from Ollama using retrieved contexts)
===========

Sources:
1. intro.pdf (chunk 0, score: 0.920)
2. ml_basics.html (chunk 1, score: 0.850)
...

Platform Notes:

All platforms: Core Python functionality is identical across Windows, Linux, and macOS
Windows users: Convenience .bat scripts are provided (start_web.bat, ingest.bat, query.bat)
Linux/Mac users: Run Python commands directly or adapt .bat scripts to shell scripts
TypeScript development: Requires npm install -g typescript for compilation
Ensure Ollama is running before starting queries on any platform

Configuration

settings.json: Main configuration file for hardware tuning (e.g., thread limits, batch size, CUDA settings, FAISS index type). Auto-generated with system-appropriate defaults.
classes/ProcessingConfig.py: Adaptive configuration that automatically adjusts memory settings based on available system RAM.
query_rag.py: Ollama API settings loaded from settings.json via config.py.

FAISS Index Types

PyRagix supports two FAISS index types via the INDEX_TYPE setting:

"ivf" (default): IVF (Inverted File) indexing for faster searches on large datasets. Configurable via NLIST (clusters, default: 1024) and NPROBE (search clusters, default: 16). Recommended for >10k documents.
"flat": Flat indexing for exhaustive search. Slower but more accurate. Recommended for smaller datasets or when maximum precision is required.

Optimal settings for modest hardware (16GB RAM, 6GB VRAM):

{
  "INDEX_TYPE": "ivf",
  "NLIST": 1024,
  "NPROBE": 16
}

GPU Acceleration

PyRagix includes intelligent GPU detection and hybrid CPU/GPU support:

Automatic Detection: Detects if GPU FAISS functions are available
Graceful Fallback: Uses CPU when GPU unavailable (default behavior)
Configurable: Enable GPU acceleration via settings.json:

{
  "GPU_ENABLED": true,
  "GPU_DEVICE": 0,
  "GPU_MEMORY_FRACTION": 0.8
}

Note: GPU FAISS requires compatible hardware and special installation. The system works perfectly with CPU-only FAISS (default) and will automatically utilize GPU capabilities when available.

For larger setups: Increase NLIST (more clusters) and NPROBE values.

Advanced Configuration

PyRagix provides extensive configuration options in settings.json for fine-tuning performance and behavior. Here's a breakdown of the more technical parameters:

Performance & Threading

TORCH_NUM_THREADS, OPENBLAS_NUM_THREADS, MKL_NUM_THREADS, OMP_NUM_THREADS, NUMEXPR_MAX_THREADS: Control CPU parallelism for different math libraries. Default is 6 threads. Increase for high-core CPUs, decrease for shared systems or to reduce memory usage.
BATCH_SIZE: Number of documents processed simultaneously during embedding (default: 16). Larger values use more memory but can be faster. Reduce if you encounter out-of-memory errors.
BATCH_SIZE_RETRY_DIVISOR: When batch processing fails due to memory, the batch size is divided by this value (default: 4) and retried. Higher values mean more aggressive fallback.

CUDA Memory Management

PYTORCH_CUDA_ALLOC_CONF: Advanced CUDA memory allocation settings:
- max_split_size_mb:1024: Maximum size (MB) for memory block splitting. Larger values reduce fragmentation but use more memory.
- garbage_collection_threshold:0.9: Triggers cleanup when 90% of allocated memory is used. Lower values free memory more aggressively.

OCR Processing

BASE_DPI: Resolution for OCR processing (default: 150). Higher values (200-300) improve text recognition accuracy but increase processing time and memory usage. Lower values (100-120) speed up processing for simple documents.

Document Processing

SKIP_FILES: Array of file patterns to ignore during ingestion (e.g., ["*.tmp", "backup_*"]). Supports glob patterns.
INGESTION_LOG_FILE, CRASH_LOG_FILE: Customize log file names for processing events and errors.

LLM Generation Parameters

TEMPERATURE: Controls response creativity (0.0-1.0, default: 0.1). Lower values produce more focused, deterministic answers. Higher values increase creativity but may reduce accuracy.
TOP_P: Nucleus sampling parameter (default: 0.9). Controls diversity by only considering tokens comprising the top 90% probability mass. Lower values make responses more focused.
MAX_TOKENS: Maximum length of generated answers (default: 500). Increase for longer responses, decrease to save time and tokens.
DEFAULT_TOP_K: Number of document chunks retrieved for each query (default: 7). More chunks provide richer context but may include less relevant information.
REQUEST_TIMEOUT: Ollama API timeout in seconds (default: 60). Increase for complex queries or slower models.

Tuning Tips

Memory-constrained systems: Reduce BATCH_SIZE to 8 or lower, decrease thread counts to 2-4, and set BASE_DPI to 100.
High-performance systems: Increase thread counts to match CPU cores, raise BATCH_SIZE to 32+, and use BASE_DPI 200-300 for better OCR.
Better answers: Increase DEFAULT_TOP_K to 10-15, raise MAX_TOKENS to 800-1000, and fine-tune TEMPERATURE (0.2-0.3 for creative but focused responses).

Requirements

PyRagix depends on a robust set of Python libraries for AI, document processing, and vector search. Key dependencies include:

torch and transformers/sentence-transformers for embedding models
faiss-cpu for vector storage and search (with optional GPU support detection)
paddleocr and paddlepaddle for OCR operations
fitz (PyMuPDF) for PDF processing
beautifulsoup4 (with optional lxml) for HTML parsing
requests for Ollama API calls
fastapi and uvicorn for the web interface and REST API
psutil for system memory detection

See requirements.in for the complete dependency list and requirements.txt for pinned versions. The system automatically adapts memory settings based on available RAM (16GB+ recommended for optimal performance).

Contributing

We welcome contributions! If you’re learning RAG or want to enhance PyRagix, here’s how to get started:

Fork the repo and create a feature branch.
Follow the installation steps above.
Submit a pull request with clear descriptions of your changes.

Ideas for contributions:

Add support for more document formats (e.g., DOCX).
Implement a web interface (planned for future releases).
Optimize for different hardware (e.g., high-end GPUs or cloud).
Enhance OCR handling or embedding models.

Please adhere to Python’s PEP 8 style guide and include type hints for consistency.

License

This project is licensed under the MIT License. See LICENSE for details.

Acknowledgements

Built with love for the Python and AI communities.
Thanks to the creators of faiss, sentence-transformers, paddleocr, ollama, and langchain for their amazing tools.

Happy learning, and enjoy searching your documents with PyRagix! 🚀

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language

Release history Release notifications | RSS feed

0.4.1

Nov 5, 2025

0.4.0

Oct 28, 2025

0.3.1

Sep 2, 2025

0.3.0

Sep 2, 2025

0.2.0

Aug 30, 2025

0.1.1

Aug 29, 2025

This version

0.1.0

Aug 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyragix-0.1.0.tar.gz (18.7 kB view details)

Uploaded Aug 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyragix-0.1.0-py3-none-any.whl (12.1 kB view details)

Uploaded Aug 28, 2025 Python 3

File details

Details for the file pyragix-0.1.0.tar.gz.

File metadata

Download URL: pyragix-0.1.0.tar.gz
Upload date: Aug 28, 2025
Size: 18.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.6

File hashes

Hashes for pyragix-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9f02141c91119ccaf7a64c85c005b484b2ba45696bac861b0f45f4bc3a2cba2e`
MD5	`77e9a90d9ea1b9f38b70313c87392edf`
BLAKE2b-256	`18f3365abe48e32520ff19abdbf3c57b39fc4f1234b447991f8e3ef0702e0b86`

See more details on using hashes here.

File details

Details for the file pyragix-0.1.0-py3-none-any.whl.

File metadata

Download URL: pyragix-0.1.0-py3-none-any.whl
Upload date: Aug 28, 2025
Size: 12.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.6

File hashes

Hashes for pyragix-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`08ba1bcf9e6f36ef1c4c36f0475e54889fb35d6f7dbea3366d452d73c5141363`
MD5	`bacb1d011327bc6de5e0b06654ca40df`
BLAKE2b-256	`8c528232b49de805359f0e01f4cf8cb5b17e8ce949db8fc0acb4de6347c1decf`

See more details on using hashes here.

pyragix 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyRagix

Features

Project Structure

Installation

Usage

Step 1: Ingest Documents

Step 2: Search Documents

Option A: Web Interface (Recommended)

Option B: Console Interface

Configuration

FAISS Index Types

GPU Acceleration

Advanced Configuration

Performance & Threading

CUDA Memory Management

OCR Processing

Document Processing

LLM Generation Parameters

Tuning Tips

Requirements

Contributing

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes