One extraction API for every document type
Project description
RuneExtract
One extraction API for every document type.
RuneExtract is a universal document extraction library that provides a single, consistent API for extracting content from any file type. Whether it's PDF, DOCX, HTML, images, YouTube videos, or Notion exports - RuneExtract returns the same structured output every time.
Vision
from runeextract import extract
data = extract("report.pdf")
print(data.text)
print(data.tables)
print(data.images)
print(data.metadata)
print(data.chunks())
One API. Any file. Same output schema.
Installation
pip install runeextract
For OCR support:
pip install runeextract[ocr]
For AI features:
pip install runeextract[ai]
Quick Start
Basic Usage
from runeextract import extract
# Extract from any file type
doc = extract("report.pdf")
# Access extracted content
print(doc.text) # Full text content
print(doc.tables) # List of tables
print(doc.images) # List of images
print(doc.metadata) # Document metadata
print(doc.chunks()) # Chunked content for RAG
With Options
doc = extract(
"report.pdf",
ocr=True,
tables=True,
chunking_strategy="semantic"
)
Batch Processing
from runeextract import extract_many
docs = extract_many([
"a.pdf",
"b.docx",
"c.html"
])
Universal Schema
All extractors return the same Document schema:
class Document:
text: str # Full text content
tables: List[Table] # Extracted tables
images: List[Image] # Extracted images
metadata: dict # Document metadata
chunks: List[Chunk] # Chunked content
source_type: str # File type identifier
Supported File Types
| Format | Status | Extracted Content |
|---|---|---|
| ✅ MVP | text, tables, images, metadata | |
| DOCX | ✅ MVP | paragraphs, tables, images, headers, footers |
| PPTX | ✅ MVP | slides, speaker notes, images |
| XLSX | ✅ MVP | worksheets, tables, formulas |
| HTML | ✅ MVP | headings, paragraphs, tables, links |
| Markdown | ✅ MVP | headings, lists, code blocks, tables |
| Images | ✅ v0.2 | text (OCR), bounding boxes, confidence |
| Scanned PDFs | ✅ v0.2 | text via OCR |
| YouTube | ✅ v0.3 | transcript, timestamps, chapters, metadata |
| Notion | ✅ v0.3 | pages, databases, content |
Features
Phase 1: Core Extractors (MVP)
- PDF: Extract text, tables, images, and metadata using PyMuPDF and pdfplumber
- DOCX: Extract paragraphs, tables, images, headers, and footers
- PPTX: Extract slides, text, tables, and images using python-pptx
- XLSX: Extract worksheets, tables, and metadata using openpyxl
- HTML: Parse headings, paragraphs, tables, and links with BeautifulSoup
- Markdown: Extract headings, lists, code blocks, and tables
Phase 2: OCR Support
Extract text from images and scanned documents:
doc = extract("invoice.jpg", ocr=True)
# Returns: text, bounding boxes, confidence scores
Supports:
- Images (JPG, PNG, etc.)
- Scanned PDFs (automatic detection and OCR processing)
Phase 3: Advanced Table Extraction
Unified table extraction across formats:
class Table:
rows: List[List[str]]
columns: List[str]
dataframe: pd.DataFrame
Supported for: PDF, DOCX, HTML, XLSX
Phase 4: Intelligent Chunking
Optimize content for RAG applications:
chunks = doc.chunks(
strategy="semantic", # by_page, by_heading, semantic, fixed_size
size=1000
)
Chunking strategies:
by_page: Split by document pagesby_heading: Split by document structuresemantic: AI-powered semantic chunkingfixed_size: Fixed-length chunks
Phase 5: Automatic Metadata
Extract rich metadata automatically:
{
"title": "",
"author": "",
"created_at": "",
"language": "",
"keywords": []
}
Phase 6: YouTube Integration
Extract video content:
doc = extract("https://youtube.com/watch?v=...")
# Returns: transcript, timestamps, chapters, metadata
Phase 7: Notion Import
Import Notion exports:
doc = extract("notion_export.zip")
# Returns: pages, databases, content
Phase 8: CLI Tool
Command-line interface for quick extraction:
# Basic extraction
runeextract file.pdf
# Advanced options
runeextract file.pdf --chunks --ocr --tables --output document.json
Phase 9: Async Processing
For large files and batch processing:
from runeextract import extract_async
doc = await extract_async("large.pdf")
Phase 10: AI Features (Optional)
Enhanced analysis with AI:
pip install runeextract[ai]
doc = extract("report.pdf")
print(doc.summary())
print(doc.keywords())
print(doc.entities())
print(doc.questions())
print(doc.flashcards())
Plugin System
Extend RuneExtract with custom extractors:
from runeextract.core.registry import register_extractor
@register_extractor(".epub")
class EPUBExtractor:
def extract(self, file_path):
# Your extraction logic
return Document(...)
Then use it seamlessly:
extract("book.epub") # Works automatically
Project Structure
runeextract/
├── core/
│ ├── extractor.py # Base extractor class
│ ├── registry.py # Plugin registry
│ ├── router.py # File type routing
│ └── schemas.py # Data models
│
├── extractors/
│ ├── pdf/ # PDF extraction
│ ├── docx/ # DOCX extraction
│ ├── pptx/ # PPTX extraction
│ ├── xlsx/ # XLSX extraction
│ ├── html/ # HTML extraction
│ ├── markdown/ # Markdown extraction
│ ├── image/ # Image/OCR extraction
│ ├── audio/ # Audio extraction
│ ├── video/ # Video extraction
│ ├── youtube/ # YouTube extraction
│ └── notion/ # Notion extraction
│
├── processors/
│ ├── ocr.py # OCR processing
│ ├── tables.py # Table extraction
│ ├── chunking.py # Content chunking
│ ├── metadata.py # Metadata extraction
│ └── cleaning.py # Text cleaning
│
├── models/
│ ├── document.py # Document model
│ ├── table.py # Table model
│ ├── image.py # Image model
│ └── chunk.py # Chunk model
│
├── cli/
│ └── main.py # CLI interface
│
└── tests/
Architecture
File
↓
Router (detects file type)
↓
Appropriate Extractor
↓
Normalization Layer
↓
Document Object (unified schema)
↓
Return
Dependencies
Core
pymupdf- PDF processingpdfplumber- Advanced PDF table extractionpython-docx- DOCX processingpython-pptx- PPTX processingopenpyxl- XLSX processingpandas- Data manipulationbeautifulsoup4- HTML parsinglxml- XML/HTML parsingmarkdown-it-py- Markdown parsing
OCR (optional)
easyocrorrapidocr- Text recognition
YouTube (optional)
youtube-transcript-api- Transcript extractionyt-dlp- Video metadata
AI Features (optional)
openaior similar - AI-powered analysis
Development Roadmap
v0.1 (MVP) — ✅ Current Release
- ✅ PDF extraction
- ✅ DOCX extraction
- ✅ PPTX extraction
- ✅ XLSX extraction
- ✅ HTML extraction
- ✅ Markdown extraction
- ✅ CLI interface
- ✅ Chunking strategies
- ✅ Plugin system
v0.2 (Planned)
- ⏳ OCR support (images and scanned PDFs)
- ⏳ YouTube integration
- ⏳ Notion import
- ⏳ Async processing
- ⏳ AI features
Contributing
Contributions are welcome! Please see our Contributing Guidelines for details.
Development Setup
# Clone the repository
git clone https://github.com/Rohithdgrr/RUNEEXTRACT-PACKAGE.git
cd runeextract
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev,ocr,ai]"
# Run tests
pytest
# Run linting
black runeextract/
flake8 runeextract/
License
MIT License - see LICENSE for details.
Why RuneExtract?
The current ecosystem requires different libraries for different file types:
PyPDF → PDF
python-docx → DOCX
BeautifulSoup → HTML
EasyOCR → Images
RuneExtract unifies all of this:
extract(anything)
That simplicity is the entire product.
Acknowledgments
Built with inspiration from the document processing community and the need for a unified extraction API.
Contact
- GitHub: Rohithdgrr/RUNEEXTRACT-PACKAGE
- Issues: GitHub Issues
- Discussions: GitHub Discussions
RuneExtract - One API to extract them all.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file runeextract-0.1.0.tar.gz.
File metadata
- Download URL: runeextract-0.1.0.tar.gz
- Upload date:
- Size: 39.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28522e39b64747923ca240b76473ab4f93666eed12658cade83331c717bc46b8
|
|
| MD5 |
d252a1f9d18806247c9e97241488540e
|
|
| BLAKE2b-256 |
d2a75f7b5c4fce28af70f41752029823b4a6f80ae716539b53664134b4907e42
|
File details
Details for the file runeextract-0.1.0-py3-none-any.whl.
File metadata
- Download URL: runeextract-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db50ef0247307273e0dcc96e6ee6fea3d5c270cb576fa6e0b9e0e9d3f382d35e
|
|
| MD5 |
ee8b48eeeba34b8989111db151660950
|
|
| BLAKE2b-256 |
7d20eb3edd4c1c8616a694c1cd7dc0c344d0cebfe813a89ce5c95994e0cb388b
|