Agentic document extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, HTML

These details have not been verified by PyPI

Project links

Repository

Project description

extracta

Agentic Document Extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, and HTML.

What is Extracta?

Extracta is a CLI tool that sends your documents to the Extracta server for intelligent, agentic extraction. It does not just pull text -- it:

Detects layout per page (single column, multi-column, mixed, table-heavy)
Determines the correct reading order (H-Major or V-Major) using projection profile analysis and Recursive XY-Cut
Threads context between segments using an LLM agent (ADE -- Agentic Document Extraction)
Outputs structured JSON with full context metadata per region

Installation

pip install extracta

Usage

extracta-extract path/to/your/document.pdf

Output is saved as document_extracted.json in the same directory as the input file.

Supported Formats

Format	Extension
PDF	`.pdf`
PowerPoint	`.pptx`
Word	`.docx`
HTML	`.html` / `.htm`

Example Output

{
  "file": "report.pdf",
  "format": "pdf",
  "total_pages": 3,
  "pages": [
    {
      "page_number": 1,
      "layout_type": "multi_col",
      "strategy": "v_major",
      "regions": [
        {
          "region_id": "p1_r1",
          "type": "title",
          "text": "Efficacy in Treatment-Naive Patients",
          "bbox": { "x0": 50.0, "y0": 40.0, "x1": 540.0, "y1": 65.0 },
          "sequence": 1,
          "context_thread_id": "thread_001",
          "context_role": "heading",
          "continues_on_page": null,
          "references_region": null
        }
      ],
      "full_text": "Efficacy in Treatment-Naive Patients\n\nIn clinical trials..."
    }
  ]
}

Terminal Output

╭──────────────────────────────────────────╮
│  Extracta -- Agentic Document Extraction  │
╰──────────────────────────────────────────╯
  File   : report.pdf
  Server : http://localhost:8000

  Analysing layout...

  Page   Layout        Strategy    Regions
  1      multi_col     V-Major     12
  2      single_col    H-Major     8
  3      mixed         V-Major     15

  Running ADE context threading...

  Done -- 3 pages | 35 regions

  Output : report_extracted.json

How It Works

File Uploaded
     ↓
[DETECT]   -- scan all pages, determine H-Major or V-Major per page
     ↓
[EXTRACT]  -- extract blocks in natural reading order using Recursive XY-Cut
     ↓
[ADE]      -- LLM agent threads context, links segments, assigns roles
     ↓
JSON Output

Layout Types

Layout Type	Description
single_col	Simple single column document
multi_col	Two or more columns (e.g. academic papers)
mixed	Complex irregular layout (e.g. pharma slides)
table_heavy	Majority of content is tabular
image_heavy	Majority of content is images

Reading Strategies

Strategy	Description
V-Major	Vertical-first -- top to bottom within each column
H-Major	Horizontal-first -- left to right across each row

Context Roles

Role	Description
heading	Section title or heading
body	Main body paragraph
callout	Sidebar, highlighted box, callout
caption	Image or table caption
footnote	Footer or footnote text
continuation	Continues directly from a previous block

Project Structure

extracta-client/
├── extracta/
│   ├── __init__.py
│   ├── cli.py          -- entry point
│   ├── client.py       -- HTTP calls to extracta-server
│   └── display.py      -- rich terminal output
├── pyproject.toml
└── README.md

Server

This CLI requires a running instance of extracta-server. By default it connects to http://localhost:8000.

To use a deployed server, update SERVER_URL in extracta/client.py.

Publishing to PyPI

pip install build twine
python -m build
twine upload dist/*

License

MIT -- see LICENSE

Author

Swapnil Bhattacharya -- NorthCommits

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.0

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extracta_ade-0.1.0-py3-none-any.whl (8.0 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file extracta_ade-0.1.0-py3-none-any.whl.

File metadata

Download URL: extracta_ade-0.1.0-py3-none-any.whl
Upload date: May 3, 2026
Size: 8.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for extracta_ade-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ba48edf4cda162909c57d9d2f978b8acfaa8919c77aac12f791304a58c9c83b7`
MD5	`efc3d9fe82586309c406070b6494fccb`
BLAKE2b-256	`97e12b1f440909bf3099f42404e9f0b09ff05070814529b68f2e2e845b272aff`

See more details on using hashes here.

extracta-ade 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

extracta

What is Extracta?

Installation

Usage

Supported Formats

Example Output

Terminal Output

How It Works

Layout Types

Reading Strategies

Context Roles

Project Structure

Server

Publishing to PyPI

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes