Skip to main content

One extraction API for every document type

Project description

RuneExtract

One extraction API for every document type.

RuneExtract is a universal document extraction library that provides a single, consistent API for extracting content from any file type. Whether it's PDF, DOCX, HTML, images, YouTube videos, or Notion exports - RuneExtract returns the same structured output every time.

Vision

from runeextract import extract

data = extract("report.pdf")

print(data.text)
print(data.tables)
print(data.images)
print(data.metadata)
print(data.chunks())

One API. Any file. Same output schema.

Installation

pip install runeextract

For OCR support:

pip install runeextract[ocr]

For AI features:

pip install runeextract[ai]

Quick Start

Basic Usage

from runeextract import extract

# Extract from any file type
doc = extract("report.pdf")

# Access extracted content
print(doc.text)           # Full text content
print(doc.tables)         # List of tables
print(doc.images)         # List of images
print(doc.metadata)       # Document metadata
print(doc.chunks())       # Chunked content for RAG

With Options

doc = extract(
    "report.pdf",
    ocr=True,
    tables=True,
    chunking_strategy="semantic"
)

Batch Processing

from runeextract import extract_many

docs = extract_many([
    "a.pdf",
    "b.docx",
    "c.html"
])

Universal Schema

All extractors return the same Document schema:

class Document:
    text: str                    # Full text content
    tables: List[Table]          # Extracted tables
    images: List[Image]          # Extracted images
    metadata: dict               # Document metadata
    chunks: List[Chunk]          # Chunked content
    source_type: str             # File type identifier

Supported File Types

Format Status Extracted Content
PDF ✅ MVP text, tables, images, metadata
DOCX ✅ MVP paragraphs, tables, images, headers, footers
PPTX ✅ MVP slides, speaker notes, images
XLSX ✅ MVP worksheets, tables, formulas
HTML ✅ MVP headings, paragraphs, tables, links
Markdown ✅ MVP headings, lists, code blocks, tables
Images ✅ v0.2 text (OCR), bounding boxes, confidence
Scanned PDFs ✅ v0.2 text via OCR
YouTube ✅ v0.3 transcript, timestamps, chapters, metadata
Notion ✅ v0.3 pages, databases, content

Features

Phase 1: Core Extractors (MVP)

  • PDF: Extract text, tables, images, and metadata using PyMuPDF and pdfplumber
  • DOCX: Extract paragraphs, tables, images, headers, and footers
  • PPTX: Extract slides, text, tables, and images using python-pptx
  • XLSX: Extract worksheets, tables, and metadata using openpyxl
  • HTML: Parse headings, paragraphs, tables, and links with BeautifulSoup
  • Markdown: Extract headings, lists, code blocks, and tables

Phase 2: OCR Support

Extract text from images and scanned documents:

doc = extract("invoice.jpg", ocr=True)
# Returns: text, bounding boxes, confidence scores

Supports:

  • Images (JPG, PNG, etc.)
  • Scanned PDFs (automatic detection and OCR processing)

Phase 3: Advanced Table Extraction

Unified table extraction across formats:

class Table:
    rows: List[List[str]]
    columns: List[str]
    dataframe: pd.DataFrame

Supported for: PDF, DOCX, HTML, XLSX

Phase 4: Intelligent Chunking

Optimize content for RAG applications:

chunks = doc.chunks(
    strategy="semantic",  # by_page, by_heading, semantic, fixed_size
    size=1000
)

Chunking strategies:

  • by_page: Split by document pages
  • by_heading: Split by document structure
  • semantic: AI-powered semantic chunking
  • fixed_size: Fixed-length chunks

Phase 5: Automatic Metadata

Extract rich metadata automatically:

{
    "title": "",
    "author": "",
    "created_at": "",
    "language": "",
    "keywords": []
}

Phase 6: YouTube Integration

Extract video content:

doc = extract("https://youtube.com/watch?v=...")
# Returns: transcript, timestamps, chapters, metadata

Phase 7: Notion Import

Import Notion exports:

doc = extract("notion_export.zip")
# Returns: pages, databases, content

Phase 8: CLI Tool

Command-line interface for quick extraction:

# Basic extraction
runeextract file.pdf

# Advanced options
runeextract file.pdf --chunks --ocr --tables --output document.json

Phase 9: Async Processing

For large files and batch processing:

from runeextract import extract_async

doc = await extract_async("large.pdf")

Phase 10: AI Features (Optional)

Enhanced analysis with AI:

pip install runeextract[ai]

doc = extract("report.pdf")
print(doc.summary())
print(doc.keywords())
print(doc.entities())
print(doc.questions())
print(doc.flashcards())

Plugin System

Extend RuneExtract with custom extractors:

from runeextract.core.registry import register_extractor

@register_extractor(".epub")
class EPUBExtractor:
    def extract(self, file_path):
        # Your extraction logic
        return Document(...)

Then use it seamlessly:

extract("book.epub")  # Works automatically

Project Structure

runeextract/
├── core/
│   ├── extractor.py      # Base extractor class
│   ├── registry.py       # Plugin registry
│   ├── router.py         # File type routing
│   └── schemas.py        # Data models
│
├── extractors/
│   ├── pdf/              # PDF extraction
│   ├── docx/             # DOCX extraction
│   ├── pptx/             # PPTX extraction
│   ├── xlsx/             # XLSX extraction
│   ├── html/             # HTML extraction
│   ├── markdown/         # Markdown extraction
│   ├── image/            # Image/OCR extraction
│   ├── audio/            # Audio extraction
│   ├── video/            # Video extraction
│   ├── youtube/          # YouTube extraction
│   └── notion/           # Notion extraction
│
├── processors/
│   ├── ocr.py            # OCR processing
│   ├── tables.py         # Table extraction
│   ├── chunking.py       # Content chunking
│   ├── metadata.py       # Metadata extraction
│   └── cleaning.py       # Text cleaning
│
├── models/
│   ├── document.py       # Document model
│   ├── table.py          # Table model
│   ├── image.py          # Image model
│   └── chunk.py          # Chunk model
│
├── cli/
│   └── main.py           # CLI interface
│
└── tests/

Architecture

File
 ↓
Router (detects file type)
 ↓
Appropriate Extractor
 ↓
Normalization Layer
 ↓
Document Object (unified schema)
 ↓
Return

Dependencies

Core

  • pymupdf - PDF processing
  • pdfplumber - Advanced PDF table extraction
  • python-docx - DOCX processing
  • python-pptx - PPTX processing
  • openpyxl - XLSX processing
  • pandas - Data manipulation
  • beautifulsoup4 - HTML parsing
  • lxml - XML/HTML parsing
  • markdown-it-py - Markdown parsing

OCR (optional)

  • easyocr or rapidocr - Text recognition

YouTube (optional)

  • youtube-transcript-api - Transcript extraction
  • yt-dlp - Video metadata

AI Features (optional)

  • openai or similar - AI-powered analysis

Development Roadmap

v0.1 (MVP) — ✅ Current Release

  • ✅ PDF extraction
  • ✅ DOCX extraction
  • ✅ PPTX extraction
  • ✅ XLSX extraction
  • ✅ HTML extraction
  • ✅ Markdown extraction
  • ✅ CLI interface
  • ✅ Chunking strategies
  • ✅ Plugin system

v0.2 (Planned)

  • ⏳ OCR support (images and scanned PDFs)
  • ⏳ YouTube integration
  • ⏳ Notion import
  • ⏳ Async processing
  • ⏳ AI features

Contributing

Contributions are welcome! Please see our Contributing Guidelines for details.

Development Setup

# Clone the repository
git clone https://github.com/Rohithdgrr/RUNEEXTRACT-PACKAGE.git
cd runeextract

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev,ocr,ai]"

# Run tests
pytest

# Run linting
black runeextract/
flake8 runeextract/

License

MIT License - see LICENSE for details.

Why RuneExtract?

The current ecosystem requires different libraries for different file types:

PyPDF            PDF
python-docx      DOCX
BeautifulSoup    HTML
EasyOCR          Images

RuneExtract unifies all of this:

extract(anything)

That simplicity is the entire product.

Acknowledgments

Built with inspiration from the document processing community and the need for a unified extraction API.

Contact


RuneExtract - One API to extract them all.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

runeextract-0.1.0.tar.gz (39.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

runeextract-0.1.0-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file runeextract-0.1.0.tar.gz.

File metadata

  • Download URL: runeextract-0.1.0.tar.gz
  • Upload date:
  • Size: 39.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for runeextract-0.1.0.tar.gz
Algorithm Hash digest
SHA256 28522e39b64747923ca240b76473ab4f93666eed12658cade83331c717bc46b8
MD5 d252a1f9d18806247c9e97241488540e
BLAKE2b-256 d2a75f7b5c4fce28af70f41752029823b4a6f80ae716539b53664134b4907e42

See more details on using hashes here.

File details

Details for the file runeextract-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: runeextract-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for runeextract-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 db50ef0247307273e0dcc96e6ee6fea3d5c270cb576fa6e0b9e0e9d3f382d35e
MD5 ee8b48eeeba34b8989111db151660950
BLAKE2b-256 7d20eb3edd4c1c8616a694c1cd7dc0c344d0cebfe813a89ce5c95994e0cb388b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page