Skip to main content

A RAG-based cheat sheet generator for books and papers

Project description

ReadAnyBook ๐Ÿ“š

PyPI Version Python 3.10+ License: MIT

A RAG-based cheat sheet generator that transforms books and papers into structured, 12-page LaTeX cheat sheets.

๐Ÿš€ Try It Now

Open In Kaggle Open In Colab

Features

  • Multi-format Document Support: PDF, EPUB, HTML, LaTeX, Markdown
  • Intelligent Chunking: Math-aware and code-aware text splitting
  • Hybrid Retrieval: Dense embeddings + BM25 with reciprocal rank fusion
  • Multi-pass Generation: Separate extraction for concepts, formulas, algorithms, and models
  • LaTeX Output: Professional cheat sheets compiled to PDF
  • Multiple LLM Backends: HuggingFace, Ollama, vLLM, OpenAI-compatible APIs
  • Vector Store Options: ChromaDB, Qdrant, Weaviate

Quick Start

Installation

# Basic installation
pip install readanybook

# With CLI support
pip install readanybook[cli]

# With all features
pip install readanybook[all]

From Source

git clone https://github.com/readanybook/readanybook.git
cd readanybook
pip install -e ".[dev]"

Usage

Command Line

# Generate a cheat sheet from a PDF
read-any-book build document.pdf -o cheatsheet.pdf

# Use a specific profile
read-any-book build document.pdf --profile math_paper

# Index a document
read-any-book index document.pdf --collection my_collection

# Search indexed documents
read-any-book search "gradient descent" --collection my_collection

Python API

from readanybook import CheatSheetPipeline, Settings

# Initialize pipeline
settings = Settings()
pipeline = CheatSheetPipeline(settings)

# Process document
pipeline.ingest("textbook.pdf")
pipeline.index(collection_name="textbook")

# Generate cheat sheet
content = pipeline.generate_content()
cheat_sheet = pipeline.build(content, "output/cheatsheet.pdf")

print(f"Generated: {cheat_sheet.pdf_path}")

Quick Start (One-Liner)

from readanybook import build_cheatsheet

# Generate a cheat sheet with a single function call
output = build_cheatsheet(
    "textbook.pdf",
    llm_backend="huggingface",      # or "ollama" for local
    llm_model="Qwen/Qwen2.5-1.5B-Instruct",
    output_format="markdown",        # "latex", "markdown", or "both"
    in_memory=True                   # Required for Kaggle/Colab
)

REST API

# Start the API server
uvicorn readanybook.api:app --host 0.0.0.0 --port 8000

# Upload a document
curl -X POST "http://localhost:8000/upload" \
  -F "file=@document.pdf" \
  -F "collection_name=my_docs"

# Generate cheat sheet
curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{"collection_name": "my_docs", "title": "My Cheat Sheet"}'

Configuration

Create a config.yaml file or use environment variables:

# Embedding model
embedding:
  model_name: "BAAI/bge-base-en-v1.5"
  device: "cuda"

# Vector store
vectordb:
  store_type: "chroma"
  persist_directory: "./data/chroma"

# LLM settings
llm:
  backend: "ollama"
  model_name: "llama3:8b"

# Retrieval
retrieval:
  mode: "hybrid"
  top_k: 15
  
# LaTeX output
latex:
  columns: 2
  font_size: 10
  paper_size: "a4paper"

Configuration Profiles

Use built-in profiles for different document types:

# For technical books
read-any-book build book.pdf --profile technical_book

# For math papers
read-any-book build paper.pdf --profile math_paper

# For non-technical books
read-any-book build novel.pdf --profile nontechnical_book

Architecture

ReadAnyBook follows a modular pipeline architecture with clear separation between ingestion, retrieval, and generation layers.

System Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Ingestion  โ”‚โ”€โ”€โ”€โ–ถโ”‚  Chunking   โ”‚โ”€โ”€โ”€โ–ถโ”‚  Indexing   โ”‚โ”€โ”€โ”€โ–ถโ”‚  Retrieval  โ”‚โ”€โ”€โ”€โ–ถโ”‚ Generation  โ”‚โ”€โ”€โ”€โ–ถโ”‚   LaTeX     โ”‚
โ”‚  (PDF/EPUB) โ”‚    โ”‚  (Math-     โ”‚    โ”‚ (Embeddings)โ”‚    โ”‚  (Hybrid    โ”‚    โ”‚  (LLM +     โ”‚    โ”‚  (Compile   โ”‚
โ”‚             โ”‚    โ”‚   aware)    โ”‚    โ”‚             โ”‚    โ”‚   Search)   โ”‚    โ”‚   RAG)      โ”‚    โ”‚   to PDF)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                             โ”‚                  โ”‚                  โ”‚
                                             โ–ผ                  โ–ผ                  โ–ผ
                                      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                      โ”‚ Vector DB   โ”‚    โ”‚ BM25 Index  โ”‚    โ”‚ LLM Backend โ”‚
                                      โ”‚ (ChromaDB)  โ”‚    โ”‚             โ”‚    โ”‚  (Ollama/   โ”‚
                                      โ”‚             โ”‚    โ”‚             โ”‚    โ”‚   HF/vLLM)  โ”‚
                                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Components

Component Description
Ingestion Multi-format document parsing (PDF, EPUB, HTML, LaTeX)
Chunking Hierarchical, semantic, or fixed-size with math/code awareness
Indexing BGE/E5 embeddings stored in ChromaDB/Qdrant/Weaviate
Retrieval Hybrid dense+sparse search with RRF fusion and cross-encoder reranking
Generation Multi-pass extraction: concepts, formulas, algorithms, models
Output Jinja2 LaTeX templates compiled to 12-page PDF

Package Structure

readanybook/
โ”œโ”€โ”€ core/           # Domain logic
โ”‚   โ”œโ”€โ”€ ingestion.py    # Document parsing
โ”‚   โ”œโ”€โ”€ chunking.py     # Text splitting
โ”‚   โ”œโ”€โ”€ indexing.py     # Embedding & indexing
โ”‚   โ”œโ”€โ”€ retrieval.py    # Hybrid retrieval
โ”‚   โ”œโ”€โ”€ models.py       # LLM clients
โ”‚   โ”œโ”€โ”€ prompts.py      # Prompt templates
โ”‚   โ””โ”€โ”€ pipeline.py     # Main orchestrator
โ”œโ”€โ”€ generation/     # Content generation
โ”‚   โ”œโ”€โ”€ concepts.py     # Concept extraction
โ”‚   โ”œโ”€โ”€ formulas.py     # Formula extraction
โ”‚   โ”œโ”€โ”€ algorithms.py   # Algorithm synthesis
โ”‚   โ”œโ”€โ”€ models_theory.py # Model summarization
โ”‚   โ””โ”€โ”€ latex_builder.py # LaTeX generation
โ”œโ”€โ”€ evaluation/     # Quality metrics
โ”‚   โ”œโ”€โ”€ rag_eval.py     # RAG evaluation
โ”‚   โ””โ”€โ”€ metrics.py      # Content metrics
โ”œโ”€โ”€ infra/          # Infrastructure
โ”‚   โ”œโ”€โ”€ settings.py     # Configuration
โ”‚   โ”œโ”€โ”€ vectordb.py     # Vector stores
โ”‚   โ”œโ”€โ”€ logging.py      # Logging
โ”‚   โ””โ”€โ”€ tracing.py      # Observability
โ”œโ”€โ”€ api/            # REST API
โ”œโ”€โ”€ cli/            # Command line interface
โ”œโ”€โ”€ templates/      # LaTeX templates
โ””โ”€โ”€ config/         # Default configs

Design Principles

  • Hexagonal Architecture: Domain services isolated from external adapters
  • Configuration-Driven: All behavior controlled via Pydantic settings
  • Pluggable Backends: LLM, vector store, and embedding model abstractions
  • Observability: Structured logging and tracing throughout

๐Ÿ“„ Full documentation: See docs/architecture.pdf for the complete software architecture document.

Requirements

  • Python 3.10+
  • PyTorch 2.0+
  • LaTeX distribution (for PDF compilation)
    • TeX Live, MiKTeX, or Tectonic

LaTeX Installation

# Ubuntu/Debian
sudo apt install texlive-full

# macOS
brew install --cask mactex

# Or use Tectonic (lightweight)
cargo install tectonic

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black readanybook tests
isort readanybook tests

# Type check
mypy readanybook

# Lint
ruff check readanybook

๐Ÿ““ Kaggle Experiment

We tested ReadAnyBook on Kaggle with Hull's "Options, Futures, and Other Derivatives" (902 pages) using a T4 GPU:

Model Backend Time Notes
microsoft/phi-2 HuggingFace ~15 min Fast, basic quality
Qwen/Qwen2.5-1.5B-Instruct HuggingFace ~20 min Better reasoning
meta-llama/Llama-3.2-3B-Instruct HuggingFace ~30 min Best quality (needs HF token)

Note: Use latex_only=True on Kaggle/Colab since pdflatex is not available.

Open In Kaggle

Examples

See the examples directory for:

  • ๐Ÿ““ readanybook_demo.ipynb - Interactive notebook tutorial
  • Processing academic papers
  • Creating ML textbook cheat sheets
  • Custom template usage
  • API integration examples

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

Acknowledgments

  • Built with ๐Ÿค— Transformers, ChromaDB, and FastAPI
  • Inspired by the need for better study materials

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readanybook-0.1.15.tar.gz (88.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

readanybook-0.1.15-py3-none-any.whl (98.5 kB view details)

Uploaded Python 3

File details

Details for the file readanybook-0.1.15.tar.gz.

File metadata

  • Download URL: readanybook-0.1.15.tar.gz
  • Upload date:
  • Size: 88.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for readanybook-0.1.15.tar.gz
Algorithm Hash digest
SHA256 c06d9182619fb8124bf3c7541d44704e1b706c221412ef010890a0e71b6139b0
MD5 21c8e74b9e9dc029b083f2d1aeb6e683
BLAKE2b-256 2262adeebeecf6d97c30b740ec5493c30c846cb44668dbfa468b2c6847cd65e3

See more details on using hashes here.

File details

Details for the file readanybook-0.1.15-py3-none-any.whl.

File metadata

  • Download URL: readanybook-0.1.15-py3-none-any.whl
  • Upload date:
  • Size: 98.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for readanybook-0.1.15-py3-none-any.whl
Algorithm Hash digest
SHA256 d16da37e07973e7ee8e0293205c6db46f6ce0d6646b64259aacb839c4c5cba21
MD5 9dbf270033d7b8f059cd6db00cc3e655
BLAKE2b-256 9a57a807655bed36a84a9147859e95be3e3e9a1e3ddadfc23f3348fd81c13ad8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page