Agentic document extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, HTML
Project description
extracta
Agentic Document Extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, and HTML.
What is Extracta?
Extracta is a CLI tool that sends your documents to the Extracta server for intelligent, agentic extraction. It does not just pull text -- it:
- Detects layout per page (single column, multi-column, mixed, table-heavy)
- Determines the correct reading order (H-Major or V-Major) using projection profile analysis and Recursive XY-Cut
- Threads context between segments using an LLM agent (ADE -- Agentic Document Extraction)
- Outputs structured JSON with full context metadata per region
Installation
pip install extracta
Usage
extracta-extract path/to/your/document.pdf
Output is saved as document_extracted.json in the same directory as the input file.
Supported Formats
| Format | Extension |
|---|---|
.pdf |
|
| PowerPoint | .pptx |
| Word | .docx |
| HTML | .html / .htm |
Example Output
{
"file": "report.pdf",
"format": "pdf",
"total_pages": 3,
"pages": [
{
"page_number": 1,
"layout_type": "multi_col",
"strategy": "v_major",
"regions": [
{
"region_id": "p1_r1",
"type": "title",
"text": "Efficacy in Treatment-Naive Patients",
"bbox": { "x0": 50.0, "y0": 40.0, "x1": 540.0, "y1": 65.0 },
"sequence": 1,
"context_thread_id": "thread_001",
"context_role": "heading",
"continues_on_page": null,
"references_region": null
}
],
"full_text": "Efficacy in Treatment-Naive Patients\n\nIn clinical trials..."
}
]
}
Terminal Output
╭──────────────────────────────────────────╮
│ Extracta -- Agentic Document Extraction │
╰──────────────────────────────────────────╯
File : report.pdf
Server : http://localhost:8000
Analysing layout...
Page Layout Strategy Regions
1 multi_col V-Major 12
2 single_col H-Major 8
3 mixed V-Major 15
Running ADE context threading...
Done -- 3 pages | 35 regions
Output : report_extracted.json
How It Works
File Uploaded
↓
[DETECT] -- scan all pages, determine H-Major or V-Major per page
↓
[EXTRACT] -- extract blocks in natural reading order using Recursive XY-Cut
↓
[ADE] -- LLM agent threads context, links segments, assigns roles
↓
JSON Output
Layout Types
| Layout Type | Description |
|---|---|
| single_col | Simple single column document |
| multi_col | Two or more columns (e.g. academic papers) |
| mixed | Complex irregular layout (e.g. pharma slides) |
| table_heavy | Majority of content is tabular |
| image_heavy | Majority of content is images |
Reading Strategies
| Strategy | Description |
|---|---|
| V-Major | Vertical-first -- top to bottom within each column |
| H-Major | Horizontal-first -- left to right across each row |
Context Roles
| Role | Description |
|---|---|
| heading | Section title or heading |
| body | Main body paragraph |
| callout | Sidebar, highlighted box, callout |
| caption | Image or table caption |
| footnote | Footer or footnote text |
| continuation | Continues directly from a previous block |
Project Structure
extracta-client/
├── extracta/
│ ├── __init__.py
│ ├── cli.py -- entry point
│ ├── client.py -- HTTP calls to extracta-server
│ └── display.py -- rich terminal output
├── pyproject.toml
└── README.md
Server
This CLI requires a running instance of extracta-server. By default it connects to http://localhost:8000.
To use a deployed server, update SERVER_URL in extracta/client.py.
Publishing to PyPI
pip install build twine
python -m build
twine upload dist/*
License
MIT -- see LICENSE
Author
Swapnil Bhattacharya -- NorthCommits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extracta_ade-0.1.0-py3-none-any.whl.
File metadata
- Download URL: extracta_ade-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba48edf4cda162909c57d9d2f978b8acfaa8919c77aac12f791304a58c9c83b7
|
|
| MD5 |
efc3d9fe82586309c406070b6494fccb
|
|
| BLAKE2b-256 |
97e12b1f440909bf3099f42404e9f0b09ff05070814529b68f2e2e845b272aff
|