Skip to main content

PDF parser for special forms using Tabula.

Project description

ktdparser

The ktdparser library is designed to extract data from a PDF file containing information about KTD (complex technical documentation). This README briefly describes the functionality and usage of ktdparser.

Installation

To install the ktdparser library, follow these steps:

  1. Install the required dependencies:
    pip install tabula-py==2.9.0 psycopg2-binary==2.9.9 openpyxl==3.1.2 PyPDF2==3.0.1 tqdm
    
  2. Install ktdparser:
    pip install ktdparser
    

Usage

Here's a simple example of how to use ktdparser:

With database saving

from ktdparser import KTDParser


parser = KTDParser()
parser.connect_to_db(password="password")
parser.parse_pdf("ktd.pdf", log="path/to/log.log", progressbar=True)
parser.save_to_db()
parser.save_to_file()

Without database saving

from ktdparser import KTDParser

parser = KTDParser()
parser.parse_pdf("ktd.pdf", progressbar=True)
parser.save_to_file("/ktd_data", "excel", from_db=False)

Methods

  1. parse_pdf: Parse the KTD file and save the results.

    Arguments:

    • file_path: Path to the PDF file to parse.
    • progressbar: Show progress indicator.
    • log: Record to log file. If True, log to default location. If False, do not log. If str, specify log file path.
    • form_top: Relative distance (%) from the top of the page to the table, excluding table headers on the first page of the form and others. If not specified, defaults to (25, 15).
    • columns: X-coordinates of columns (9). If not specified, defaults to (56.07, 94.2, 130.82, 329.66, 522.58, 626.26, 659.23, 722.54, 780.11).
    • workers: Number of threads for parallel parsing.
  2. connect_to_db: Establish connection to the database.

    Arguments:

    • password: Password for the database user.
    • user: Database user name.
    • host: Database host address.
    • port: Database port number.
    • database: Database name.
  3. save_to_db: Save data to the database.

  4. save_to_file: Save data to an Excel/CSV file.

    Arguments:

    • path: Path to the file to save data.
    • file_type: File type for saving data ("csv" or "excel").
    • ktd_id: Identifier of the KTD (used when from_db is True). Defaults to the last saved KTD in the database.
    • from_db: Determine whether data should be retrieved from the database or from the tables attribute.
  5. get_ktd_list: Get a list of all saved KTD identifiers in the database.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ktdparser-0.0.2.tar.gz (11.3 kB view details)

Uploaded Source

File details

Details for the file ktdparser-0.0.2.tar.gz.

File metadata

  • Download URL: ktdparser-0.0.2.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.8

File hashes

Hashes for ktdparser-0.0.2.tar.gz
Algorithm Hash digest
SHA256 2d96ded54f2e94f2dabf76a69e203a8d541aed6768ff9ac286a65c764ed098ea
MD5 1a3831719d6a109d45bc509c17b01a5e
BLAKE2b-256 7e988f9e7a077a75457ab522c1fee8925da7fb6942e9ea8518b8f4ec312a0482

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page