Skip to main content

Agentic document extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, HTML

Project description

extracta

Agentic Document Extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, and HTML.

PyPI version Python License: MIT


What is Extracta?

Extracta is a CLI tool that sends your documents to the Extracta server for intelligent, agentic extraction. It does not just pull text -- it:

  • Detects layout per page (single column, multi-column, mixed, table-heavy)
  • Determines the correct reading order (H-Major or V-Major) using projection profile analysis and Recursive XY-Cut
  • Threads context between segments using an LLM agent (ADE -- Agentic Document Extraction)
  • Outputs structured JSON with full context metadata per region

Installation

pip install extracta

Usage

extracta-extract path/to/your/document.pdf

Output is saved as document_extracted.json in the same directory as the input file.

Supported Formats

Format Extension
PDF .pdf
PowerPoint .pptx
Word .docx
HTML .html / .htm

Example Output

{
  "file": "report.pdf",
  "format": "pdf",
  "total_pages": 3,
  "pages": [
    {
      "page_number": 1,
      "layout_type": "multi_col",
      "strategy": "v_major",
      "regions": [
        {
          "region_id": "p1_r1",
          "type": "title",
          "text": "Efficacy in Treatment-Naive Patients",
          "bbox": { "x0": 50.0, "y0": 40.0, "x1": 540.0, "y1": 65.0 },
          "sequence": 1,
          "context_thread_id": "thread_001",
          "context_role": "heading",
          "continues_on_page": null,
          "references_region": null
        }
      ],
      "full_text": "Efficacy in Treatment-Naive Patients\n\nIn clinical trials..."
    }
  ]
}

Terminal Output

╭──────────────────────────────────────────╮
│  Extracta -- Agentic Document Extraction  │
╰──────────────────────────────────────────╯
  File   : report.pdf
  Server : http://localhost:8000

  Analysing layout...

  Page   Layout        Strategy    Regions
  1      multi_col     V-Major     12
  2      single_col    H-Major     8
  3      mixed         V-Major     15

  Running ADE context threading...

  Done -- 3 pages | 35 regions

  Output : report_extracted.json

How It Works

File Uploaded
     ↓
[DETECT]   -- scan all pages, determine H-Major or V-Major per page
     ↓
[EXTRACT]  -- extract blocks in natural reading order using Recursive XY-Cut
     ↓
[ADE]      -- LLM agent threads context, links segments, assigns roles
     ↓
JSON Output

Layout Types

Layout Type Description
single_col Simple single column document
multi_col Two or more columns (e.g. academic papers)
mixed Complex irregular layout (e.g. pharma slides)
table_heavy Majority of content is tabular
image_heavy Majority of content is images

Reading Strategies

Strategy Description
V-Major Vertical-first -- top to bottom within each column
H-Major Horizontal-first -- left to right across each row

Context Roles

Role Description
heading Section title or heading
body Main body paragraph
callout Sidebar, highlighted box, callout
caption Image or table caption
footnote Footer or footnote text
continuation Continues directly from a previous block

Project Structure

extracta-client/
├── extracta/
│   ├── __init__.py
│   ├── cli.py          -- entry point
│   ├── client.py       -- HTTP calls to extracta-server
│   └── display.py      -- rich terminal output
├── pyproject.toml
└── README.md

Server

This CLI requires a running instance of extracta-server. By default it connects to http://localhost:8000.

To use a deployed server, update SERVER_URL in extracta/client.py.


Publishing to PyPI

pip install build twine
python -m build
twine upload dist/*

License

MIT -- see LICENSE


Author

Swapnil Bhattacharya -- NorthCommits

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extracta_ade-0.1.0-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file extracta_ade-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: extracta_ade-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for extracta_ade-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ba48edf4cda162909c57d9d2f978b8acfaa8919c77aac12f791304a58c9c83b7
MD5 efc3d9fe82586309c406070b6494fccb
BLAKE2b-256 97e12b1f440909bf3099f42404e9f0b09ff05070814529b68f2e2e845b272aff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page