One extraction API for every document type

These details have not been verified by PyPI

Project links

Project description

RuneExtract

One extraction API for every document type.

RuneExtract is a universal document extraction library that provides a single, consistent API for extracting content from any file type. Whether it's PDF, DOCX, HTML, images, YouTube videos, or Notion exports — RuneExtract returns the same structured output every time.

from runeextract import extract

doc = extract("report.pdf")

print(doc.text)
print(doc.tables)
print(doc.images)
print(doc.metadata)
print(doc.chunks())

One API. Any file. Same output schema.

Installation

pip install runeextract

Optional extras:

Extra	Packages	Feature
`[ocr]`	easyocr, Pillow	OCR for images and scanned PDFs
`[ai]`	openai	AI summarization, keywords, entities, Q&A, flashcards
`[youtube]`	youtube-transcript-api, yt-dlp	YouTube transcript and metadata
`[notion]`	requests, aiohttp	Notion page and database extraction
`[epub]`	EbookLib, beautifulsoup4	EPUB e-book extraction
`[async]`	aiohttp	Async HTTP support for URL extractors
`[all]`	All of the above	Everything

Quick Start

from runeextract import extract

doc = extract("report.pdf", ocr=True, tables=True, chunking_strategy="semantic")
print(doc.text)
print(doc.tables[0].to_dataframe())

Batch & Async

from runeextract import extract_many, extract_many_async

docs = extract_many(["a.pdf", "b.docx", "c.html"])

docs = await extract_many_async(["a.pdf", "b.docx", "c.html"], max_concurrency=4)

CLI

runeextract file.pdf --ocr --chunking semantic --tree
runeextract https://youtube.com/watch?v=... --youtube-format transcript
runeextract scanned.pdf --ocr --output-dir ./output
runeextract ./docs --watch         # Watch directory for new files

Supported Formats

Format	Status	Content Extracted
PDF	✅ v0.2.0	text, tables, images, metadata, scanned-page OCR
DOCX	✅ v0.2.0	paragraphs, tables, images, metadata, image OCR
PPTX	✅ v0.2.0	slides, tables, images, metadata, speaker notes, image OCR
XLSX	✅ v0.2.0	worksheets, tables, metadata, multiple sheets
HTML	✅ v0.2.0	headings, paragraphs, tables, links, meta tags
Markdown	✅ v0.2.0	headings, lists, code blocks, tables, frontmatter
CSV	✅ v0.2.0	tables, text, row/column metadata
JSON	✅ v0.2.0	pretty-print, table from list-of-dicts
Images (PNG/JPG/TIFF/BMP/WebP)	✅ v0.2.0	metadata (width, height, format), OCR text
EPUB	✅ v0.2.0	text, tables, images, metadata (title, author, etc.)
YouTube	✅ v0.2.0	transcript, timestamps, chapters, metadata
Notion	✅ v0.2.0	pages, databases, 14 block types, async

Features

Intelligent Chunking

chunks = doc.chunks(
    strategy="semantic",  # by_page, by_heading, semantic, fixed_size
    size=1000
)

AI Processing (optional)

doc = extract("report.pdf")
print(doc.summary())
print(doc.keywords())
print(doc.entities())
print(doc.questions())
print(doc.flashcards())

OCR (optional)

doc = extract("invoice.jpg", ocr=True, ocr_lang="en,fr")

Streaming

from runeextract import extract_stream

async for page_doc in extract_stream("large.pdf"):
    process(page_doc)

Plugin System

from runeextract.core.registry import register_extractor

@register_extractor(".txt")
class TxtExtractor(BaseExtractor):
    def extract(self, file_path):
        return Document(text=open(file_path).read(), source_type="txt")

Configuration

export RUNEEXTRACT_OCR=true
export RUNEEXTRACT_MAX_FILE_SIZE=999999999

Or create ~/.runeextract.json:

{"ocr": true, "tables": false, "max_file_size": 1000000}

Caching

from runeextract.core.cache import ExtractionCache
cache = ExtractionCache(ttl=3600)

Project Structure

runeextract/
├── __init__.py          # Public API (extract, extract_many, extract_async, etc.)
├── config.py            # Configuration system (env, JSON, pyproject.toml)
├── exceptions.py        # Custom exception hierarchy
├── cli/main.py          # 14 CLI flags
├── core/
│   ├── extractor.py     # BaseExtractor, StreamingExtractor
│   ├── router.py        # ExtractorRouter (19 file extensions)
│   ├── registry.py      # Plugin registry with entry-point discovery
│   ├── cache.py         # diskcache/JSON cache layer
│   ├── schemas.py       # ExtractionOptions, ExtractionResult
│   └── streaming.py     # get_streaming_extractor, wrapped fallback
├── extractors/
│   ├── pdf/             # PDF (PyMuPDF + pdfplumber)
│   ├── docx/            # DOCX (python-docx)
│   ├── pptx/            # PPTX (python-pptx)
│   ├── xlsx/            # XLSX (openpyxl)
│   ├── html/            # HTML (BeautifulSoup)
│   ├── markdown/        # Markdown (markdown-it-py)
│   ├── csv/             # CSV (stdlib csv)
│   ├── json/            # JSON (stdlib json)
│   ├── image/           # Image (Pillow + easyocr)
│   ├── epub/            # EPUB (EbookLib)
│   ├── youtube/         # YouTube (youtube-transcript-api + yt-dlp)
│   └── notion/          # Notion (REST API + aiohttp)
├── processors/
│   ├── ocr.py           # easyocr-based OCR
│   └── ai.py            # OpenAI / local transformers AI processor
├── models/
│   └── document.py      # Document, Table, Image, Chunk, ChunkingStrategy
└── tests/
    ├── 103 tests across 18 files
    └── benchmarks/

Architecture

File/URL → Router (detects type) → Extractor → Document (unified schema)

Development

pip install -e ".[dev]"
pytest                     # 103 tests

License

MIT — see LICENSE.

Why RuneExtract?

Instead of learning PyMuPDF, python-docx, BeautifulSoup, EasyOCR, etc.:

extract(anything)

That simplicity is the entire product.

RuneExtract v0.2.0 — One API to extract them all.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Jun 14, 2026

0.1.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

runeextract-0.3.0.tar.gz (70.8 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

runeextract-0.3.0-py3-none-any.whl (83.8 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file runeextract-0.3.0.tar.gz.

File metadata

Download URL: runeextract-0.3.0.tar.gz
Upload date: Jun 14, 2026
Size: 70.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for runeextract-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`322775363a6c7a835058d9e0d8303d419b8ae4375f103600be6328df1416de53`
MD5	`88e46af752365c10bac3b84c3985f78d`
BLAKE2b-256	`81f39d5c925615602245f162211205f3e0678c28836f0252cd19f6b1f90a8fe3`

See more details on using hashes here.

File details

Details for the file runeextract-0.3.0-py3-none-any.whl.

File metadata

Download URL: runeextract-0.3.0-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 83.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for runeextract-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9b94af489cf1b5d9a98a53938a4755afb892c3914173f83b49474d9969a2910d`
MD5	`9635584df7c5bc30a3f004084a027bbc`
BLAKE2b-256	`c1d65bfc3037f2485c286e9e4f2b65090e656458066e75f7836ba278d5bcd734`

See more details on using hashes here.

runeextract 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RuneExtract

Installation

Quick Start

Batch & Async

CLI

Supported Formats

Features

Intelligent Chunking

AI Processing (optional)

OCR (optional)

Streaming

Plugin System

Configuration

Caching

Project Structure

Architecture

Development

License

Why RuneExtract?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes