Skip to main content

Smart text extraction from PDF documents

Project description

Tests Documentation PyPI Coverage DOI

EDS-PDF

EDS-PDF provides a modular framework to extract text information from PDF documents.

You can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models:

Visit the :book: documentation for more information!

Getting started

Installation

Install the library with pip:

pip install edspdf

Extracting text

Let's build a simple PDF extractor that uses a rule-based classifier. There are two ways to do this, either by using the configuration system or by using the pipeline API.

Create a configuration file:

config.cfg
[pipeline]
pipeline = ["extractor", "classifier", "aggregator"]

[components.extractor]
@factory = "pdfminer-extractor"

[components.classifier]
@factory = "mask-classifier"
x0 = 0.2
x1 = 0.9
y0 = 0.3
y1 = 0.6
threshold = 0.1

[components.aggregator]
@factory = "simple-aggregator"

and load it from Python:

import edspdf
from pathlib import Path

model = edspdf.load("config.cfg")  # (1)

Or create a pipeline directly from Python:

from edspdf import Pipeline

model = Pipeline()
model.add_pipe("pdfminer-extractor")
model.add_pipe(
    "mask-classifier",
    config=dict(
        x0=0.2,
        x1=0.9,
        y0=0.3,
        y1=0.6,
        threshold=0.1,
    ),
)
model.add_pipe("simple-aggregator")

This pipeline can then be applied (for instance with this PDF):

# Get a PDF
pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()
pdf = model(pdf)

body = pdf.aggregated_texts["body"]

text, style = body.text, body.properties

See the rule-based recipe for a step-by-step explanation of what is happening.

Citation

If you use EDS-PDF, please cite us as below.

@software{edspdf,
  author  = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
  doi     = {10.5281/zenodo.6902977},
  license = {BSD-3-Clause},
  title   = {{EDS-PDF: Smart text extraction from PDF documents}},
  url     = {https://github.com/aphp/edspdf}
}

Acknowledgement

We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edspdf-0.9.2.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

edspdf-0.9.2-py3-none-any.whl (98.8 kB view details)

Uploaded Python 3

File details

Details for the file edspdf-0.9.2.tar.gz.

File metadata

  • Download URL: edspdf-0.9.2.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for edspdf-0.9.2.tar.gz
Algorithm Hash digest
SHA256 8ec43ae6c696bc21e0dbaba7e9a262ed72613f2ca2cb9b2bfb4b43e0a69d489f
MD5 7c82154bec48f04fa47ac27c84a01739
BLAKE2b-256 cee92d2ddf7ed8f6a2996f15336eace76c0af27ab48ea65aa64d59b88498eff1

See more details on using hashes here.

Provenance

The following attestation bundles were made for edspdf-0.9.2.tar.gz:

Publisher: release.yml on aphp/edspdf

Attestations:

File details

Details for the file edspdf-0.9.2-py3-none-any.whl.

File metadata

  • Download URL: edspdf-0.9.2-py3-none-any.whl
  • Upload date:
  • Size: 98.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for edspdf-0.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a2a77d7904d5f056e0287b067eccf8c814549ed185d81239d52fcfe588691069
MD5 654669b2a249e717da4d7f7cdd6171d0
BLAKE2b-256 a5d6aa3e0956f7ceec5d803805528d6c296f4dfdefe27be30b7b88ee7634d149

See more details on using hashes here.

Provenance

The following attestation bundles were made for edspdf-0.9.2-py3-none-any.whl:

Publisher: release.yml on aphp/edspdf

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page