One extraction API for every document type

These details have not been verified by PyPI

Project links

Project description

RuneExtract

One extraction API for every document type.

RuneExtract is a universal document extraction library that provides a single, consistent API for extracting content from any file type. Whether it's PDF, DOCX, HTML, images, YouTube videos, or Notion exports - RuneExtract returns the same structured output every time.

Vision

from runeextract import extract

data = extract("report.pdf")

print(data.text)
print(data.tables)
print(data.images)
print(data.metadata)
print(data.chunks())

One API. Any file. Same output schema.

Installation

pip install runeextract

For OCR support:

pip install runeextract[ocr]

For AI features:

pip install runeextract[ai]

Quick Start

Basic Usage

from runeextract import extract

# Extract from any file type
doc = extract("report.pdf")

# Access extracted content
print(doc.text)           # Full text content
print(doc.tables)         # List of tables
print(doc.images)         # List of images
print(doc.metadata)       # Document metadata
print(doc.chunks())       # Chunked content for RAG

With Options

doc = extract(
    "report.pdf",
    ocr=True,
    tables=True,
    chunking_strategy="semantic"
)

Batch Processing

from runeextract import extract_many

docs = extract_many([
    "a.pdf",
    "b.docx",
    "c.html"
])

Universal Schema

All extractors return the same Document schema:

class Document:
    text: str                    # Full text content
    tables: List[Table]          # Extracted tables
    images: List[Image]          # Extracted images
    metadata: dict               # Document metadata
    chunks: List[Chunk]          # Chunked content
    source_type: str             # File type identifier

Supported File Types

Format	Status	Extracted Content
PDF	✅ MVP	text, tables, images, metadata
DOCX	✅ MVP	paragraphs, tables, images, headers, footers
PPTX	✅ MVP	slides, speaker notes, images
XLSX	✅ MVP	worksheets, tables, formulas
HTML	✅ MVP	headings, paragraphs, tables, links
Markdown	✅ MVP	headings, lists, code blocks, tables
Images	✅ v0.2	text (OCR), bounding boxes, confidence
Scanned PDFs	✅ v0.2	text via OCR
YouTube	✅ v0.3	transcript, timestamps, chapters, metadata
Notion	✅ v0.3	pages, databases, content

Features

Phase 1: Core Extractors (MVP)

PDF: Extract text, tables, images, and metadata using PyMuPDF and pdfplumber
DOCX: Extract paragraphs, tables, images, headers, and footers
PPTX: Extract slides, text, tables, and images using python-pptx
XLSX: Extract worksheets, tables, and metadata using openpyxl
HTML: Parse headings, paragraphs, tables, and links with BeautifulSoup
Markdown: Extract headings, lists, code blocks, and tables

Phase 2: OCR Support

Extract text from images and scanned documents:

doc = extract("invoice.jpg", ocr=True)
# Returns: text, bounding boxes, confidence scores

Supports:

Images (JPG, PNG, etc.)
Scanned PDFs (automatic detection and OCR processing)

Phase 3: Advanced Table Extraction

Unified table extraction across formats:

class Table:
    rows: List[List[str]]
    columns: List[str]
    dataframe: pd.DataFrame

Supported for: PDF, DOCX, HTML, XLSX

Phase 4: Intelligent Chunking

Optimize content for RAG applications:

chunks = doc.chunks(
    strategy="semantic",  # by_page, by_heading, semantic, fixed_size
    size=1000
)

Chunking strategies:

by_page: Split by document pages
by_heading: Split by document structure
semantic: AI-powered semantic chunking
fixed_size: Fixed-length chunks

Phase 5: Automatic Metadata

Extract rich metadata automatically:

{
    "title": "",
    "author": "",
    "created_at": "",
    "language": "",
    "keywords": []
}

Phase 6: YouTube Integration

Extract video content:

doc = extract("https://youtube.com/watch?v=...")
# Returns: transcript, timestamps, chapters, metadata

Phase 7: Notion Import

Import Notion exports:

doc = extract("notion_export.zip")
# Returns: pages, databases, content

Phase 8: CLI Tool

Command-line interface for quick extraction:

# Basic extraction
runeextract file.pdf

# Advanced options
runeextract file.pdf --chunks --ocr --tables --output document.json

Phase 9: Async Processing

For large files and batch processing:

from runeextract import extract_async

doc = await extract_async("large.pdf")

Phase 10: AI Features (Optional)

Enhanced analysis with AI:

pip install runeextract[ai]

doc = extract("report.pdf")
print(doc.summary())
print(doc.keywords())
print(doc.entities())
print(doc.questions())
print(doc.flashcards())

Plugin System

Extend RuneExtract with custom extractors:

from runeextract.core.registry import register_extractor

@register_extractor(".epub")
class EPUBExtractor:
    def extract(self, file_path):
        # Your extraction logic
        return Document(...)

Then use it seamlessly:

extract("book.epub")  # Works automatically

Project Structure

runeextract/
├── core/
│   ├── extractor.py      # Base extractor class
│   ├── registry.py       # Plugin registry
│   ├── router.py         # File type routing
│   └── schemas.py        # Data models
│
├── extractors/
│   ├── pdf/              # PDF extraction
│   ├── docx/             # DOCX extraction
│   ├── pptx/             # PPTX extraction
│   ├── xlsx/             # XLSX extraction
│   ├── html/             # HTML extraction
│   ├── markdown/         # Markdown extraction
│   ├── image/            # Image/OCR extraction
│   ├── audio/            # Audio extraction
│   ├── video/            # Video extraction
│   ├── youtube/          # YouTube extraction
│   └── notion/           # Notion extraction
│
├── processors/
│   ├── ocr.py            # OCR processing
│   ├── tables.py         # Table extraction
│   ├── chunking.py       # Content chunking
│   ├── metadata.py       # Metadata extraction
│   └── cleaning.py       # Text cleaning
│
├── models/
│   ├── document.py       # Document model
│   ├── table.py          # Table model
│   ├── image.py          # Image model
│   └── chunk.py          # Chunk model
│
├── cli/
│   └── main.py           # CLI interface
│
└── tests/

Architecture

File
 ↓
Router (detects file type)
 ↓
Appropriate Extractor
 ↓
Normalization Layer
 ↓
Document Object (unified schema)
 ↓
Return

Dependencies

Core

pymupdf - PDF processing
pdfplumber - Advanced PDF table extraction
python-docx - DOCX processing
python-pptx - PPTX processing
openpyxl - XLSX processing
pandas - Data manipulation
beautifulsoup4 - HTML parsing
lxml - XML/HTML parsing
markdown-it-py - Markdown parsing

OCR (optional)

easyocr or rapidocr - Text recognition

YouTube (optional)

youtube-transcript-api - Transcript extraction
yt-dlp - Video metadata

AI Features (optional)

openai or similar - AI-powered analysis

Development Roadmap

v0.1 (MVP) — ✅ Current Release

✅ PDF extraction
✅ DOCX extraction
✅ PPTX extraction
✅ XLSX extraction
✅ HTML extraction
✅ Markdown extraction
✅ CLI interface
✅ Chunking strategies
✅ Plugin system

v0.2 (Planned)

⏳ OCR support (images and scanned PDFs)
⏳ YouTube integration
⏳ Notion import
⏳ Async processing
⏳ AI features

Contributing

Contributions are welcome! Please see our Contributing Guidelines for details.

Development Setup

# Clone the repository
git clone https://github.com/Rohithdgrr/RUNEEXTRACT-PACKAGE.git
cd runeextract

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev,ocr,ai]"

# Run tests
pytest

# Run linting
black runeextract/
flake8 runeextract/

License

MIT License - see LICENSE for details.

Why RuneExtract?

The current ecosystem requires different libraries for different file types:

PyPDF           → PDF
python-docx     → DOCX
BeautifulSoup   → HTML
EasyOCR         → Images

RuneExtract unifies all of this:

extract(anything)

That simplicity is the entire product.

Acknowledgments

Built with inspiration from the document processing community and the need for a unified extraction API.

Contact

RuneExtract - One API to extract them all.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Jun 14, 2026

This version

0.1.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

runeextract-0.1.0.tar.gz (39.7 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

runeextract-0.1.0-py3-none-any.whl (32.1 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file runeextract-0.1.0.tar.gz.

File metadata

Download URL: runeextract-0.1.0.tar.gz
Upload date: Jun 14, 2026
Size: 39.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for runeextract-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`28522e39b64747923ca240b76473ab4f93666eed12658cade83331c717bc46b8`
MD5	`d252a1f9d18806247c9e97241488540e`
BLAKE2b-256	`d2a75f7b5c4fce28af70f41752029823b4a6f80ae716539b53664134b4907e42`

See more details on using hashes here.

File details

Details for the file runeextract-0.1.0-py3-none-any.whl.

File metadata

Download URL: runeextract-0.1.0-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 32.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for runeextract-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`db50ef0247307273e0dcc96e6ee6fea3d5c270cb576fa6e0b9e0e9d3f382d35e`
MD5	`ee8b48eeeba34b8989111db151660950`
BLAKE2b-256	`7d20eb3edd4c1c8616a694c1cd7dc0c344d0cebfe813a89ce5c95994e0cb388b`

See more details on using hashes here.

runeextract 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RuneExtract

Vision

Installation

Quick Start

Basic Usage

With Options

Batch Processing

Universal Schema

Supported File Types

Features

Phase 1: Core Extractors (MVP)

Phase 2: OCR Support

Phase 3: Advanced Table Extraction

Phase 4: Intelligent Chunking

Phase 5: Automatic Metadata

Phase 6: YouTube Integration

Phase 7: Notion Import

Phase 8: CLI Tool

Phase 9: Async Processing

Phase 10: AI Features (Optional)

Plugin System

Project Structure

Architecture

Dependencies

Core

OCR (optional)

YouTube (optional)

AI Features (optional)

Development Roadmap

v0.1 (MVP) — ✅ Current Release

v0.2 (Planned)

Contributing

Development Setup

License

Why RuneExtract?

Acknowledgments

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes