A RAG-based cheat sheet generator for books and papers
Project description
ReadAnyBook ๐
A RAG-based cheat sheet generator that transforms books and papers into structured, 12-page LaTeX cheat sheets.
Features
- Multi-format Document Support: PDF, EPUB, HTML, LaTeX, Markdown
- Intelligent Chunking: Math-aware and code-aware text splitting
- Hybrid Retrieval: Dense embeddings + BM25 with reciprocal rank fusion
- Multi-pass Generation: Separate extraction for concepts, formulas, algorithms, and models
- LaTeX Output: Professional cheat sheets compiled to PDF
- Multiple LLM Backends: HuggingFace, Ollama, vLLM, OpenAI-compatible APIs
- Vector Store Options: ChromaDB, Qdrant, Weaviate
Quick Start
Installation
# Basic installation
pip install readanybook
# With CLI support
pip install readanybook[cli]
# With all features
pip install readanybook[all]
From Source
git clone https://github.com/readanybook/readanybook.git
cd readanybook
pip install -e ".[dev]"
Usage
Command Line
# Generate a cheat sheet from a PDF
read-any-book build document.pdf -o cheatsheet.pdf
# Use a specific profile
read-any-book build document.pdf --profile math_paper
# Index a document
read-any-book index document.pdf --collection my_collection
# Search indexed documents
read-any-book search "gradient descent" --collection my_collection
Python API
from readanybook import CheatSheetPipeline, Settings
# Initialize pipeline
settings = Settings()
pipeline = CheatSheetPipeline(settings)
# Process document
pipeline.ingest("textbook.pdf")
pipeline.index(collection_name="textbook")
# Generate cheat sheet
content = pipeline.generate_content()
cheat_sheet = pipeline.build(content, "output/cheatsheet.pdf")
print(f"Generated: {cheat_sheet.pdf_path}")
REST API
# Start the API server
uvicorn readanybook.api:app --host 0.0.0.0 --port 8000
# Upload a document
curl -X POST "http://localhost:8000/upload" \
-F "file=@document.pdf" \
-F "collection_name=my_docs"
# Generate cheat sheet
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{"collection_name": "my_docs", "title": "My Cheat Sheet"}'
Configuration
Create a config.yaml file or use environment variables:
# Embedding model
embedding:
model_name: "BAAI/bge-base-en-v1.5"
device: "cuda"
# Vector store
vectordb:
store_type: "chroma"
persist_directory: "./data/chroma"
# LLM settings
llm:
backend: "ollama"
model_name: "llama3:8b"
# Retrieval
retrieval:
mode: "hybrid"
top_k: 15
# LaTeX output
latex:
columns: 2
font_size: 10
paper_size: "a4paper"
Configuration Profiles
Use built-in profiles for different document types:
# For technical books
read-any-book build book.pdf --profile technical_book
# For math papers
read-any-book build paper.pdf --profile math_paper
# For non-technical books
read-any-book build novel.pdf --profile nontechnical_book
Architecture
ReadAnyBook follows a modular pipeline architecture with clear separation between ingestion, retrieval, and generation layers.
System Design
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Ingestion โโโโโถโ Chunking โโโโโถโ Indexing โโโโโถโ Retrieval โโโโโถโ Generation โโโโโถโ LaTeX โ
โ (PDF/EPUB) โ โ (Math- โ โ (Embeddings)โ โ (Hybrid โ โ (LLM + โ โ (Compile โ
โ โ โ aware) โ โ โ โ Search) โ โ RAG) โ โ to PDF) โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Vector DB โ โ BM25 Index โ โ LLM Backend โ
โ (ChromaDB) โ โ โ โ (Ollama/ โ
โ โ โ โ โ HF/vLLM) โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
Key Components
| Component | Description |
|---|---|
| Ingestion | Multi-format document parsing (PDF, EPUB, HTML, LaTeX) |
| Chunking | Hierarchical, semantic, or fixed-size with math/code awareness |
| Indexing | BGE/E5 embeddings stored in ChromaDB/Qdrant/Weaviate |
| Retrieval | Hybrid dense+sparse search with RRF fusion and cross-encoder reranking |
| Generation | Multi-pass extraction: concepts, formulas, algorithms, models |
| Output | Jinja2 LaTeX templates compiled to 12-page PDF |
Package Structure
readanybook/
โโโ core/ # Domain logic
โ โโโ ingestion.py # Document parsing
โ โโโ chunking.py # Text splitting
โ โโโ indexing.py # Embedding & indexing
โ โโโ retrieval.py # Hybrid retrieval
โ โโโ models.py # LLM clients
โ โโโ prompts.py # Prompt templates
โ โโโ pipeline.py # Main orchestrator
โโโ generation/ # Content generation
โ โโโ concepts.py # Concept extraction
โ โโโ formulas.py # Formula extraction
โ โโโ algorithms.py # Algorithm synthesis
โ โโโ models_theory.py # Model summarization
โ โโโ latex_builder.py # LaTeX generation
โโโ evaluation/ # Quality metrics
โ โโโ rag_eval.py # RAG evaluation
โ โโโ metrics.py # Content metrics
โโโ infra/ # Infrastructure
โ โโโ settings.py # Configuration
โ โโโ vectordb.py # Vector stores
โ โโโ logging.py # Logging
โ โโโ tracing.py # Observability
โโโ api/ # REST API
โโโ cli/ # Command line interface
โโโ templates/ # LaTeX templates
โโโ config/ # Default configs
Design Principles
- Hexagonal Architecture: Domain services isolated from external adapters
- Configuration-Driven: All behavior controlled via Pydantic settings
- Pluggable Backends: LLM, vector store, and embedding model abstractions
- Observability: Structured logging and tracing throughout
๐ Full documentation: See docs/architecture.pdf for the complete software architecture document.
Requirements
- Python 3.10+
- PyTorch 2.0+
- LaTeX distribution (for PDF compilation)
- TeX Live, MiKTeX, or Tectonic
LaTeX Installation
# Ubuntu/Debian
sudo apt install texlive-full
# macOS
brew install --cask mactex
# Or use Tectonic (lightweight)
cargo install tectonic
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black readanybook tests
isort readanybook tests
# Type check
mypy readanybook
# Lint
ruff check readanybook
Examples
See the examples directory for:
- Processing academic papers
- Creating ML textbook cheat sheets
- Custom template usage
- API integration examples
License
MIT License - see LICENSE for details.
Contributing
Contributions welcome! Please read CONTRIBUTING.md first.
Acknowledgments
- Built with ๐ค Transformers, ChromaDB, and FastAPI
- Inspired by the need for better study materials
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file readanybook-0.1.9.tar.gz.
File metadata
- Download URL: readanybook-0.1.9.tar.gz
- Upload date:
- Size: 84.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01e744d0177d971168d61eb290d0e3d91ee087302e85f73b4685ad06c8da9e6a
|
|
| MD5 |
7d17cde60005b21164b8dde45b616082
|
|
| BLAKE2b-256 |
741acc25216a5b928fb9e1334028b32c722370dabfba2429029dbfb35fe54aaa
|
File details
Details for the file readanybook-0.1.9-py3-none-any.whl.
File metadata
- Download URL: readanybook-0.1.9-py3-none-any.whl
- Upload date:
- Size: 94.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06014333621c12c78efe2858f36d22d727a6fa4596ce6c60aa875a0bf7467085
|
|
| MD5 |
2f9649a21f128353bc5489d0d48b61ed
|
|
| BLAKE2b-256 |
6dba78034493da19eb1c254f1a1cbf4ce74e7127bab0a88800579c7a888b7d29
|