Skip to main content

PDF parser for special forms using Tabula.

Project description

ktdparser

The ktdparser library is designed to extract data from a PDF file containing information about KTD (complex technical documentation). This README briefly describes the functionality and usage of ktdparser.

Installation

To install the ktdparser library, follow these steps:

  1. Install the required dependencies:
    pip install tabula-py==2.9.0 psycopg2-binary==2.9.9 openpyxl==3.1.2 PyPDF2==3.0.1 tqdm
    
  2. Install ktdparser:
    pip install ktdparser
    

Usage

Here's a simple example of how to use ktdparser:

With database saving

from ktdparser import KTDParser


parser = KTDParser()
parser.connect_to_db(password="password")
parser.parse_pdf("ktd.pdf", log="path/to/log.log", progressbar=True)
parser.save_to_db()
parser.save_to_file()

Without database saving

from ktdparser import KTDParser

parser = KTDParser()
parser.parse_pdf("ktd.pdf", progressbar=True)
parser.save_to_file("/ktd_data", "excel", from_db=False)

Methods

  1. parse_pdf: Parse the KTD file and save the results.

    Arguments:

    • file_path: Path to the PDF file to parse.
    • progressbar: Show progress indicator.
    • log: Record to log file. If True, log to default location. If False, do not log. If str, specify log file path.
    • form_top: Relative distance (%) from the top of the page to the table, excluding table headers on the first page of the form and others. If not specified, defaults to (25, 15).
    • columns: X-coordinates of columns (9). If not specified, defaults to (56.07, 94.2, 130.82, 329.66, 522.58, 626.26, 659.23, 722.54, 780.11).
    • workers: Number of threads for parallel parsing.
  2. connect_to_db: Establish connection to the database.

    Arguments:

    • password: Password for the database user.
    • user: Database user name.
    • host: Database host address.
    • port: Database port number.
    • database: Database name.
  3. save_to_db: Save data to the database.

  4. save_to_file: Save data to an Excel/CSV file.

    Arguments:

    • path: Path to the file to save data.
    • file_type: File type for saving data ("csv" or "excel").
    • ktd_id: Identifier of the KTD (used when from_db is True). Defaults to the last saved KTD in the database.
    • from_db: Determine whether data should be retrieved from the database or from the tables attribute.
  5. get_ktd_list: Get a list of all saved KTD identifiers in the database.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ktdparser-0.0.3.tar.gz (11.5 kB view details)

Uploaded Source

File details

Details for the file ktdparser-0.0.3.tar.gz.

File metadata

  • Download URL: ktdparser-0.0.3.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.8

File hashes

Hashes for ktdparser-0.0.3.tar.gz
Algorithm Hash digest
SHA256 8f239370d6952c040a7fb7f5ddedfeea51c7bb293925ee61318d44c7959365ea
MD5 8b9b4b6ac460952e8e60943bbc0671ea
BLAKE2b-256 e35afa3855314a3bb70295165a960af4af0a67cf8f337642fa5ef7d88a21c059

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page