Skip to main content

PDF parser for special forms using Tabula.

Project description

ktdparser

The ktdparser library is designed to extract data from a PDF file containing information about KTD (complex technical documentation). This README briefly describes the functionality and usage of ktdparser.

Installation

To install the ktdparser library, follow these steps:

  1. Install the required dependencies:
    pip install tabula-py==2.9.0 psycopg2-binary==2.9.9 openpyxl==3.1.2 PyPDF2==3.0.1 tqdm
    
  2. Install ktdparser:
    pip install ktdparser
    

Usage

Here's a simple example of how to use ktdparser:

With database saving

from ktdparser import KTDParser


parser = KTDParser()
parser.connect_to_db(password="password")
parser.parse_pdf("ktd.pdf", log="path/to/log.log", progressbar=True)
parser.save_to_db()
parser.save_to_file()

Without database saving

from ktdparser import KTDParser

parser = KTDParser()
parser.parse_pdf("ktd.pdf", progressbar=True)
parser.save_to_file("/ktd_data", "excel", from_db=False)

Methods

  1. parse_pdf: Parse the KTD file and save the results.

    Arguments:

    • file_path: Path to the PDF file to parse.
    • progressbar: Show progress indicator.
    • log: Record to log file. If True, log to default location. If False, do not log. If str, specify log file path.
    • form_top: Relative distance (%) from the top of the page to the table, excluding table headers on the first page of the form and others. If not specified, defaults to (25, 15).
    • columns: X-coordinates of columns (9). If not specified, defaults to (56.07, 94.2, 130.82, 329.66, 522.58, 626.26, 659.23, 722.54, 780.11).
    • workers: Number of threads for parallel parsing.
  2. connect_to_db: Establish connection to the database.

    Arguments:

    • password: Password for the database user.
    • user: Database user name.
    • host: Database host address.
    • port: Database port number.
    • database: Database name.
  3. save_to_db: Save data to the database.

  4. save_to_file: Save data to an Excel/CSV file.

    Arguments:

    • path: Path to the file to save data.
    • file_type: File type for saving data ("csv" or "excel").
    • ktd_id: Identifier of the KTD (used when from_db is True). Defaults to the last saved KTD in the database.
    • from_db: Determine whether data should be retrieved from the database or from the tables attribute.
  5. get_ktd_list: Get a list of all saved KTD identifiers in the database.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ktdparser-0.0.4.tar.gz (11.5 kB view details)

Uploaded Source

File details

Details for the file ktdparser-0.0.4.tar.gz.

File metadata

  • Download URL: ktdparser-0.0.4.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.8

File hashes

Hashes for ktdparser-0.0.4.tar.gz
Algorithm Hash digest
SHA256 3a7e6ebee8a444cd177c8b9a580ec25c2e57f2f7d9b368da205aed7914b315a1
MD5 2a0cace36283c886b87755a645065c2a
BLAKE2b-256 e1ba39078a4d6148495e813cc053b131ed30bb65ede90149552fae1d5fed7e24

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page