Skip to main content

PDF text extractor.

Project description

pdftxt

The goal of this project is to provide an api to extract text from specific regions of a pdf document/page and a cli to assist identifying the location of text within a document.

Installation

... pip install pdftxt

Basic Command Line Usage

Let's say we have a PDF file (PDF-DOC.pdf) that looks like this:

Source File Image

The pdftxt command:

... pdftxt PDF-DOC.pdf

Will output a visual layout of the pdf document's pages and text elements to an html page:

Output File Image

API Usage

from pathlib import Path
from pdftxt import api

filepath = 'tests/Word_PDF.pdf'

with api.PdfTxtContext(filepath) as pdf:

    for page in pdf:

        # To fetch text objects from specific region
        # of the page, first define the region:
        region = api.Region(400, 300, 512, 317)

        # Initialize layout parameters:
        params = api.PdfTxtParams()

        # Then analyze that area of the page for text objects:
        text = page.analyze(region, params)

        # Do whatever it is we need to do with the results:
        for txt in text:
            print(txt.text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftxt-0.3.0.tar.gz (13.1 kB view hashes)

Uploaded Source

Built Distribution

pdftxt-0.3.0-py3-none-any.whl (41.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page