Skip to main content

One extraction API for every document type

Project description

RuneExtract

One extraction API for every document type.

RuneExtract is a universal document extraction library that provides a single, consistent API for extracting content from any file type. Whether it's PDF, DOCX, HTML, images, YouTube videos, or Notion exports — RuneExtract returns the same structured output every time.

from runeextract import extract

doc = extract("report.pdf")

print(doc.text)
print(doc.tables)
print(doc.images)
print(doc.metadata)
print(doc.chunks())

One API. Any file. Same output schema.

Installation

pip install runeextract

Optional extras:

Extra Packages Feature
[ocr] easyocr, Pillow OCR for images and scanned PDFs
[ai] openai AI summarization, keywords, entities, Q&A, flashcards
[youtube] youtube-transcript-api, yt-dlp YouTube transcript and metadata
[notion] requests, aiohttp Notion page and database extraction
[epub] EbookLib, beautifulsoup4 EPUB e-book extraction
[async] aiohttp Async HTTP support for URL extractors
[all] All of the above Everything

Quick Start

from runeextract import extract

doc = extract("report.pdf", ocr=True, tables=True, chunking_strategy="semantic")
print(doc.text)
print(doc.tables[0].to_dataframe())

Batch & Async

from runeextract import extract_many, extract_many_async

docs = extract_many(["a.pdf", "b.docx", "c.html"])

docs = await extract_many_async(["a.pdf", "b.docx", "c.html"], max_concurrency=4)

CLI

runeextract file.pdf --ocr --chunking semantic --tree
runeextract https://youtube.com/watch?v=... --youtube-format transcript
runeextract scanned.pdf --ocr --output-dir ./output
runeextract ./docs --watch         # Watch directory for new files

Supported Formats

Format Status Content Extracted
PDF ✅ v0.2.0 text, tables, images, metadata, scanned-page OCR
DOCX ✅ v0.2.0 paragraphs, tables, images, metadata, image OCR
PPTX ✅ v0.2.0 slides, tables, images, metadata, speaker notes, image OCR
XLSX ✅ v0.2.0 worksheets, tables, metadata, multiple sheets
HTML ✅ v0.2.0 headings, paragraphs, tables, links, meta tags
Markdown ✅ v0.2.0 headings, lists, code blocks, tables, frontmatter
CSV ✅ v0.2.0 tables, text, row/column metadata
JSON ✅ v0.2.0 pretty-print, table from list-of-dicts
Images (PNG/JPG/TIFF/BMP/WebP) ✅ v0.2.0 metadata (width, height, format), OCR text
EPUB ✅ v0.2.0 text, tables, images, metadata (title, author, etc.)
YouTube ✅ v0.2.0 transcript, timestamps, chapters, metadata
Notion ✅ v0.2.0 pages, databases, 14 block types, async

Features

Intelligent Chunking

chunks = doc.chunks(
    strategy="semantic",  # by_page, by_heading, semantic, fixed_size
    size=1000
)

AI Processing (optional)

doc = extract("report.pdf")
print(doc.summary())
print(doc.keywords())
print(doc.entities())
print(doc.questions())
print(doc.flashcards())

OCR (optional)

doc = extract("invoice.jpg", ocr=True, ocr_lang="en,fr")

Streaming

from runeextract import extract_stream

async for page_doc in extract_stream("large.pdf"):
    process(page_doc)

Plugin System

from runeextract.core.registry import register_extractor

@register_extractor(".txt")
class TxtExtractor(BaseExtractor):
    def extract(self, file_path):
        return Document(text=open(file_path).read(), source_type="txt")

Configuration

export RUNEEXTRACT_OCR=true
export RUNEEXTRACT_MAX_FILE_SIZE=999999999

Or create ~/.runeextract.json:

{"ocr": true, "tables": false, "max_file_size": 1000000}

Caching

from runeextract.core.cache import ExtractionCache
cache = ExtractionCache(ttl=3600)

Project Structure

runeextract/
├── __init__.py          # Public API (extract, extract_many, extract_async, etc.)
├── config.py            # Configuration system (env, JSON, pyproject.toml)
├── exceptions.py        # Custom exception hierarchy
├── cli/main.py          # 14 CLI flags
├── core/
│   ├── extractor.py     # BaseExtractor, StreamingExtractor
│   ├── router.py        # ExtractorRouter (19 file extensions)
│   ├── registry.py      # Plugin registry with entry-point discovery
│   ├── cache.py         # diskcache/JSON cache layer
│   ├── schemas.py       # ExtractionOptions, ExtractionResult
│   └── streaming.py     # get_streaming_extractor, wrapped fallback
├── extractors/
│   ├── pdf/             # PDF (PyMuPDF + pdfplumber)
│   ├── docx/            # DOCX (python-docx)
│   ├── pptx/            # PPTX (python-pptx)
│   ├── xlsx/            # XLSX (openpyxl)
│   ├── html/            # HTML (BeautifulSoup)
│   ├── markdown/        # Markdown (markdown-it-py)
│   ├── csv/             # CSV (stdlib csv)
│   ├── json/            # JSON (stdlib json)
│   ├── image/           # Image (Pillow + easyocr)
│   ├── epub/            # EPUB (EbookLib)
│   ├── youtube/         # YouTube (youtube-transcript-api + yt-dlp)
│   └── notion/          # Notion (REST API + aiohttp)
├── processors/
│   ├── ocr.py           # easyocr-based OCR
│   └── ai.py            # OpenAI / local transformers AI processor
├── models/
│   └── document.py      # Document, Table, Image, Chunk, ChunkingStrategy
└── tests/
    ├── 103 tests across 18 files
    └── benchmarks/

Architecture

File/URL → Router (detects type) → Extractor → Document (unified schema)

Development

pip install -e ".[dev]"
pytest                     # 103 tests

License

MIT — see LICENSE.

Why RuneExtract?

Instead of learning PyMuPDF, python-docx, BeautifulSoup, EasyOCR, etc.:

extract(anything)

That simplicity is the entire product.


RuneExtract v0.2.0 — One API to extract them all.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

runeextract-0.3.0.tar.gz (70.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

runeextract-0.3.0-py3-none-any.whl (83.8 kB view details)

Uploaded Python 3

File details

Details for the file runeextract-0.3.0.tar.gz.

File metadata

  • Download URL: runeextract-0.3.0.tar.gz
  • Upload date:
  • Size: 70.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for runeextract-0.3.0.tar.gz
Algorithm Hash digest
SHA256 322775363a6c7a835058d9e0d8303d419b8ae4375f103600be6328df1416de53
MD5 88e46af752365c10bac3b84c3985f78d
BLAKE2b-256 81f39d5c925615602245f162211205f3e0678c28836f0252cd19f6b1f90a8fe3

See more details on using hashes here.

File details

Details for the file runeextract-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: runeextract-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 83.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for runeextract-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9b94af489cf1b5d9a98a53938a4755afb892c3914173f83b49474d9969a2910d
MD5 9635584df7c5bc30a3f004084a027bbc
BLAKE2b-256 c1d65bfc3037f2485c286e9e4f2b65090e656458066e75f7836ba278d5bcd734

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page