Skip to main content

Tool for extracting text and tables from PDF files and saving this data in docx format

Project description

pdfwordify

pdfwordify is a tool for extracting text and tables from PDF files and saving this data in docx(Word) format. This project is designed to automate the process of transferring information from PDF to formats that are easier to edit and process.

Features

  • Text extraction from PDF.
  • Extract text from scanned pages to PDF.
  • Extract tables from PDF.
  • Save extracted information to a Word file.

How to use

  • Install Python 3.10 or newer.

  • Install Google tesseract OCR

  • Install the library using pip:

    pip install pdfwordify
    
  • Use the command-line interface to convert from PDF to docx.

    pdfwordify example.pdf
    
  • Or use it with Python.

    from pdfwordify.converter import convert_to_docx
    
    convert_to_docx("example.pdf")
    

Arguments

This section will provide arguments for using the converter. They are suitable for use within the command line as well as for use within Python.

  • pdf_path:

    • Description: The path to the input PDF file to be converted.
    • Required: Yes
    • Example:
      • In terminal: pdfwordify dir/example.pdf.
      • In code: convert_to_docx("dir/example.pdf").
  • output_dir:

    • Description: The path for the docx file. Can be either a folder path, a named path, or a full path specifying the file(docx) extension.
    • Required: No
    • Default: PDF file directory is used
    • Example:
      • In terminal: pdfwordify dir/example.pdf /output/path/.
      • In code: convert_to_docx("dir/example.pdf", "/output/path/")
  • method:

    • Description: Method for extracting tables from a file.
    • Required: No
    • Default: lattice
    • Types:
      • lattice for tables that have distinct boundaries.

        Table with clear boundaries
      • stream for tables that have clear borders.

        Table with no borders
      • None if there are no tables in the document.

    • Example:
      • In terminal: pdfwordify --method stream dir/example.pdf.
      • In code: convert_to_docx("example.pdf", method=None).
  • lang:

    • Description: Language for extracting text from images within a document using Google Tesseract OCR.
    • Required: No
    • Default: eng
    • Note: It is possible to combine languages. For example: rus+eng
    • Example:
      • In terminal: pdfwordify --lang rus+eng dir/example.pdf.
      • In code: convert_to_docx("example.pdf", lang="rus+eng").

Settings

To further customize the settings, edit the config.py file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfwordify-0.0.1.tar.gz (175.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfwordify-0.0.1-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file pdfwordify-0.0.1.tar.gz.

File metadata

  • Download URL: pdfwordify-0.0.1.tar.gz
  • Upload date:
  • Size: 175.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for pdfwordify-0.0.1.tar.gz
Algorithm Hash digest
SHA256 610b27aa1f385606580dfd0b30442e67e513d5f128795972bc77c53a4200bb83
MD5 308eef003e3f6f3215294be3df7e48c4
BLAKE2b-256 254535a1bd525a7ca601bfc045c71761fde6e48ffaca68ff2304525ae1b31391

See more details on using hashes here.

File details

Details for the file pdfwordify-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pdfwordify-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for pdfwordify-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3829fbc15677b873d3a49fe2ad78e65b488b7a9265bab6e720e88f71729608c1
MD5 64c38263e6f5cfc69c322670a41de65f
BLAKE2b-256 ec6c2eef6fa81eb08e31c366ad7284eab8f853b432279eba5475a49be88d13be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page