Skip to main content

PDF text extractor.

Project description

pdftxt

The goal of this project is to provide an api to extract text from specific regions of a pdf document/page and a cli to assist identifying the location of text within a document.

Installation

... pip install pdftxt

Basic Command Line Usage

Let's say we have a PDF file (PDF-DOC.pdf) that looks like this:

Source File Image

The pdftxt command:

... pdftxt PDF-DOC.pdf

Will output a visual layout of the pdf document's pages and text elements to an html page:

Output File Image

API Usage

from pathlib import Path
from pdftxt import api

filepath = 'tests/Word_PDF.pdf'

with api.PdfTxtContext(filepath) as pdf:

    for page in pdf:

        # To fetch text objects from specific region
        # of the page, first define the region:
        region = api.Region(400, 300, 512, 317)

        # Initialize layout parameters:
        params = api.PdfTxtParams()

        # Then analyze that area of the page for text objects:
        text = page.analyze(region, params)

        # Do whatever it is we need to do with the results:
        for txt in text:
            print(txt.text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pdftxt, version 0.3.2
Filename, size File type Python version Upload date Hashes
Filename, size pdftxt-0.3.2-py3-none-any.whl (44.2 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size pdftxt-0.3.2.tar.gz (13.7 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page