Skip to main content

PDF text extractor.

Project description

pdftxt

The goal of this project is to provide an api to extract text from specific regions of a pdf document/page and a cli to assist identifying the location of text within a document.

Installation

... pip install pdftxt

Basic Command Line Usage

Let's say we have a PDF file (PDF-DOC.pdf) that looks like this:

Source File Image

The pdftxt command:

... pdftxt PDF-DOC.pdf

Will output a visual layout of the pdf document's pages and text elements to an html page:

Output File Image

API Usage

from pathlib import Path
from pdftxt import api

filepath = 'tests/Word_PDF.pdf'

with api.PdfTxtContext(filepath) as pdf:

    for page in pdf:

        # To fetch text objects from specific region
        # of the page, first define the region:
        region = api.Region(400, 300, 512, 317)

        # Initialize layout parameters:
        params = api.PdfTxtParams()

        # Then analyze that area of the page for text objects:
        text = page.analyze(region, params)

        # Do whatever it is we need to do with the results:
        for txt in text:
            print(txt.text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pdftxt, version 0.3.2
Filename, size File type Python version Upload date Hashes
Filename, size pdftxt-0.3.2-py3-none-any.whl (44.2 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size pdftxt-0.3.2.tar.gz (13.7 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page