Skip to main content

Smart text extraction from PDF documents

Project description

Tests Documentation PyPI Coverage DOI

EDS-PDF

EDS-PDF provides a modular framework to extract text information from PDF documents.

You can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models:

Visit the :book: documentation for more information!

Getting started

Installation

Install the library with pip:

pip install edspdf

Extracting text

Let's build a simple PDF extractor that uses a rule-based classifier. There are two ways to do this, either by using the configuration system or by using the pipeline API.

Create a configuration file:

config.cfg
[pipeline]
pipeline = ["extractor", "classifier", "aggregator"]

[components.extractor]
@factory = "pdfminer-extractor"

[components.classifier]
@factory = "mask-classifier"
x0 = 0.2
x1 = 0.9
y0 = 0.3
y1 = 0.6
threshold = 0.1

[components.aggregator]
@factory = "simple-aggregator"

and load it from Python:

import edspdf
from pathlib import Path

model = edspdf.load("config.cfg")  # (1)

Or create a pipeline directly from Python:

from edspdf import Pipeline

model = Pipeline()
model.add_pipe("pdfminer-extractor")
model.add_pipe(
    "mask-classifier",
    config=dict(
        x0=0.2,
        x1=0.9,
        y0=0.3,
        y1=0.6,
        threshold=0.1,
    ),
)
model.add_pipe("simple-aggregator")

This pipeline can then be applied (for instance with this PDF):

# Get a PDF
pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()
pdf = model(pdf)

body = pdf.aggregated_texts["body"]

text, style = body.text, body.properties

See the rule-based recipe for a step-by-step explanation of what is happening.

Citation

If you use EDS-PDF, please cite us as below.

@software{edspdf,
  author  = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
  doi     = {10.5281/zenodo.6902977},
  license = {BSD-3-Clause},
  title   = {{EDS-PDF: Smart text extraction from PDF documents}},
  url     = {https://github.com/aphp/edspdf}
}

Acknowledgement

We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edspdf-0.10.0.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

edspdf-0.10.0-py3-none-any.whl (100.7 kB view details)

Uploaded Python 3

File details

Details for the file edspdf-0.10.0.tar.gz.

File metadata

  • Download URL: edspdf-0.10.0.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for edspdf-0.10.0.tar.gz
Algorithm Hash digest
SHA256 67708cf34269527f69d881f1338ffb1935eb0044cf2f1af2806ad22ca67b62a1
MD5 f2f1a0736838470e87f3b7ab71795e39
BLAKE2b-256 4e3b29afc5e765f42edcddc1c3e377cb6c6c4b7b18dd4477ce5fd82acf4a4733

See more details on using hashes here.

Provenance

The following attestation bundles were made for edspdf-0.10.0.tar.gz:

Publisher: release.yml on aphp/edspdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file edspdf-0.10.0-py3-none-any.whl.

File metadata

  • Download URL: edspdf-0.10.0-py3-none-any.whl
  • Upload date:
  • Size: 100.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for edspdf-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5225655888d9fb5af894757b775ce3b9f4e06bc92f5ebcbee4df0316b48b44a6
MD5 5375f0318230c8c25721c22b05074975
BLAKE2b-256 017abcd85613c55c9586954aa40ce4ca51f8100292fbe07d44e49f3b1d64cd30

See more details on using hashes here.

Provenance

The following attestation bundles were made for edspdf-0.10.0-py3-none-any.whl:

Publisher: release.yml on aphp/edspdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page