Skip to main content

Easily extract data from PDFs.

Project description

Extraxt

Extraxt is a Python-based MuPDF library that enables parsing and extracting data from Healthlink PDF documents.

Core Functionality

  • Nested JSON Output: Constructs nested JSON objects reflecting the document's content.
  • Subtitle and Field Matching: Define subtitles and corresponding data fields in snake case (e.g. first_name, address_line_one, income_(secondary)).
  • Sensitive Data Configuration: Enables sensitive data controls and configuration via the API (Coming soon).

Extraxt streamlines the extraction process, converting PDF content into structured JSON for easy data manipulation and integration.

Installation

Install Extraxt

pip install extraxt

Upgrade to new version of Extraxt

pip install --upgrade extraxt

Using Conda with Extraxt

conda create --name [YOUR_ENV] python=3.11 -y
conda activate [YOUR_ENV]
pip install extraxt

Usage

Extraxt is able to consume either an asynchronous byte stream or a buffer directly from disk.

Before you begin:

  • Matching something like Phone (Secondary) -> phone_(secondary) will require the usage of parenthesis as of 0.0.17. This will soon be opt in, where by default the parenthesis will be redacted.
  • As of 0.0.17, sensitive data is not configurable via the API, and instead "Date of birth" is parsed as "age" only.

Read file from disk

Reading from a Buffer stream can be done using with open as is standard in Python. From there you can invoke .read() on the binary and pass your fields specification. fields accepts an object of user-input key's (subtitles), where the value is a series of matches (snaked_cased) to that of the exact PDF text content within your document.

from extraxt import Extraxt

from .config import FIELDS

extraxt = Extraxt()


def main():
    with open("file.pdf", "rb") as buffer:
        stream = buffer.read()
        output = extraxt.read(stream, FIELDS)
        print(output)


if __name__ == "__main__":
    main()

Read file in asynchronous API

FastAPI

For cases using FastAPI, Extraxt is a synchronous package and will block the main thread. To perform non-blocking/asynchronous extraction, you will need to use asyncio and Futures.

import traceback
import json

from fastapi import File, HTTPException, JSONResponse
from extraxt import Extraxt

from .util import event_loop
from .config import FIELDS

extraxt = Extraxt()


async def process_file(file: File):
    try:
        content = file.read()
        if not content:
            raise HTTPException(500, "Failed to read file.")
        content = await event_loop(extraxt.read, content, FIELDS)

    except Exception as e:
        tb = traceback.format_exc()
        raise HTTPException(500, f"Failed to triage file {tb}")

    return JSONResponse({
        "content": json.loads(content),
    })

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extraxt-0.0.18.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extraxt-0.0.18-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file extraxt-0.0.18.tar.gz.

File metadata

  • Download URL: extraxt-0.0.18.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for extraxt-0.0.18.tar.gz
Algorithm Hash digest
SHA256 d7ae69c00095fced6935b9b84fb463846f5829586b3e6dd5a5c52003eed6cabb
MD5 8967e7331d57595b508c3992afdbef79
BLAKE2b-256 aa2db3a6de8457b88c06c2dba28f1c986e61586a8cf5902bf3c5a59a4f25bd9c

See more details on using hashes here.

File details

Details for the file extraxt-0.0.18-py3-none-any.whl.

File metadata

  • Download URL: extraxt-0.0.18-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for extraxt-0.0.18-py3-none-any.whl
Algorithm Hash digest
SHA256 cd5afa37039ac0c16b91bdabb5c115e231f8d876857c4e5de041ec489232070f
MD5 0fc04e140d4f4a5c4f8b4d499055dbb1
BLAKE2b-256 d0f0c96d0b1ba62f02c14a969e1c3b5ecac663495fe0143333fb8bf2c0492d93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page