One extraction API for every document type
Project description
RuneExtract
One extraction API for every document type.
RuneExtract is a universal document extraction library that provides a single, consistent API for extracting content from any file type. Whether it's PDF, DOCX, HTML, images, YouTube videos, or Notion exports — RuneExtract returns the same structured output every time.
from runeextract import extract
doc = extract("report.pdf")
print(doc.text)
print(doc.tables)
print(doc.images)
print(doc.metadata)
print(doc.chunks())
One API. Any file. Same output schema.
Installation
pip install runeextract
Optional extras:
| Extra | Packages | Feature |
|---|---|---|
[ocr] |
easyocr, Pillow | OCR for images and scanned PDFs |
[ai] |
openai | AI summarization, keywords, entities, Q&A, flashcards |
[youtube] |
youtube-transcript-api, yt-dlp | YouTube transcript and metadata |
[notion] |
requests, aiohttp | Notion page and database extraction |
[epub] |
EbookLib, beautifulsoup4 | EPUB e-book extraction |
[async] |
aiohttp | Async HTTP support for URL extractors |
[all] |
All of the above | Everything |
Quick Start
from runeextract import extract
doc = extract("report.pdf", ocr=True, tables=True, chunking_strategy="semantic")
print(doc.text)
print(doc.tables[0].to_dataframe())
Batch & Async
from runeextract import extract_many, extract_many_async
docs = extract_many(["a.pdf", "b.docx", "c.html"])
docs = await extract_many_async(["a.pdf", "b.docx", "c.html"], max_concurrency=4)
CLI
runeextract file.pdf --ocr --chunking semantic --tree
runeextract https://youtube.com/watch?v=... --youtube-format transcript
runeextract scanned.pdf --ocr --output-dir ./output
runeextract ./docs --watch # Watch directory for new files
Supported Formats
| Format | Status | Content Extracted |
|---|---|---|
| ✅ v0.2.0 | text, tables, images, metadata, scanned-page OCR | |
| DOCX | ✅ v0.2.0 | paragraphs, tables, images, metadata, image OCR |
| PPTX | ✅ v0.2.0 | slides, tables, images, metadata, speaker notes, image OCR |
| XLSX | ✅ v0.2.0 | worksheets, tables, metadata, multiple sheets |
| HTML | ✅ v0.2.0 | headings, paragraphs, tables, links, meta tags |
| Markdown | ✅ v0.2.0 | headings, lists, code blocks, tables, frontmatter |
| CSV | ✅ v0.2.0 | tables, text, row/column metadata |
| JSON | ✅ v0.2.0 | pretty-print, table from list-of-dicts |
| Images (PNG/JPG/TIFF/BMP/WebP) | ✅ v0.2.0 | metadata (width, height, format), OCR text |
| EPUB | ✅ v0.2.0 | text, tables, images, metadata (title, author, etc.) |
| YouTube | ✅ v0.2.0 | transcript, timestamps, chapters, metadata |
| Notion | ✅ v0.2.0 | pages, databases, 14 block types, async |
Features
Intelligent Chunking
chunks = doc.chunks(
strategy="semantic", # by_page, by_heading, semantic, fixed_size
size=1000
)
AI Processing (optional)
doc = extract("report.pdf")
print(doc.summary())
print(doc.keywords())
print(doc.entities())
print(doc.questions())
print(doc.flashcards())
OCR (optional)
doc = extract("invoice.jpg", ocr=True, ocr_lang="en,fr")
Streaming
from runeextract import extract_stream
async for page_doc in extract_stream("large.pdf"):
process(page_doc)
Plugin System
from runeextract.core.registry import register_extractor
@register_extractor(".txt")
class TxtExtractor(BaseExtractor):
def extract(self, file_path):
return Document(text=open(file_path).read(), source_type="txt")
Configuration
export RUNEEXTRACT_OCR=true
export RUNEEXTRACT_MAX_FILE_SIZE=999999999
Or create ~/.runeextract.json:
{"ocr": true, "tables": false, "max_file_size": 1000000}
Caching
from runeextract.core.cache import ExtractionCache
cache = ExtractionCache(ttl=3600)
Project Structure
runeextract/
├── __init__.py # Public API (extract, extract_many, extract_async, etc.)
├── config.py # Configuration system (env, JSON, pyproject.toml)
├── exceptions.py # Custom exception hierarchy
├── cli/main.py # 14 CLI flags
├── core/
│ ├── extractor.py # BaseExtractor, StreamingExtractor
│ ├── router.py # ExtractorRouter (19 file extensions)
│ ├── registry.py # Plugin registry with entry-point discovery
│ ├── cache.py # diskcache/JSON cache layer
│ ├── schemas.py # ExtractionOptions, ExtractionResult
│ └── streaming.py # get_streaming_extractor, wrapped fallback
├── extractors/
│ ├── pdf/ # PDF (PyMuPDF + pdfplumber)
│ ├── docx/ # DOCX (python-docx)
│ ├── pptx/ # PPTX (python-pptx)
│ ├── xlsx/ # XLSX (openpyxl)
│ ├── html/ # HTML (BeautifulSoup)
│ ├── markdown/ # Markdown (markdown-it-py)
│ ├── csv/ # CSV (stdlib csv)
│ ├── json/ # JSON (stdlib json)
│ ├── image/ # Image (Pillow + easyocr)
│ ├── epub/ # EPUB (EbookLib)
│ ├── youtube/ # YouTube (youtube-transcript-api + yt-dlp)
│ └── notion/ # Notion (REST API + aiohttp)
├── processors/
│ ├── ocr.py # easyocr-based OCR
│ └── ai.py # OpenAI / local transformers AI processor
├── models/
│ └── document.py # Document, Table, Image, Chunk, ChunkingStrategy
└── tests/
├── 103 tests across 18 files
└── benchmarks/
Architecture
File/URL → Router (detects type) → Extractor → Document (unified schema)
Development
pip install -e ".[dev]"
pytest # 103 tests
License
MIT — see LICENSE.
Why RuneExtract?
Instead of learning PyMuPDF, python-docx, BeautifulSoup, EasyOCR, etc.:
extract(anything)
That simplicity is the entire product.
RuneExtract v0.2.0 — One API to extract them all.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file runeextract-0.3.0.tar.gz.
File metadata
- Download URL: runeextract-0.3.0.tar.gz
- Upload date:
- Size: 70.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
322775363a6c7a835058d9e0d8303d419b8ae4375f103600be6328df1416de53
|
|
| MD5 |
88e46af752365c10bac3b84c3985f78d
|
|
| BLAKE2b-256 |
81f39d5c925615602245f162211205f3e0678c28836f0252cd19f6b1f90a8fe3
|
File details
Details for the file runeextract-0.3.0-py3-none-any.whl.
File metadata
- Download URL: runeextract-0.3.0-py3-none-any.whl
- Upload date:
- Size: 83.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b94af489cf1b5d9a98a53938a4755afb892c3914173f83b49474d9969a2910d
|
|
| MD5 |
9635584df7c5bc30a3f004084a027bbc
|
|
| BLAKE2b-256 |
c1d65bfc3037f2485c286e9e4f2b65090e656458066e75f7836ba278d5bcd734
|