Tool for extracting text and tables from PDF files and saving this data in docx format

These details have not been verified by PyPI

Project links

Homepage

Project description

pdfwordify

pdfwordify is a tool for extracting text and tables from PDF files and saving this data in docx(Word) format. This project is designed to automate the process of transferring information from PDF to formats that are easier to edit and process.

Features

Text extraction from PDF.
Extract text from scanned pages to PDF.
Extract tables from PDF.
Save extracted information to a Word file.

How to use

Install Python 3.10 or newer.
Install Google tesseract OCR
Install the library using pip:
```
pip install pdfwordify
```
Use the command-line interface to convert from PDF to docx.
```
pdfwordify example.pdf
```

Or use it with Python.

from pdfwordify.converter import convert_to_docx

convert_to_docx("example.pdf")

Arguments

This section will provide arguments for using the converter. They are suitable for use within the command line as well as for use within Python.

pdf_path:
- Description: The path to the input PDF file to be converted.
- Required: Yes
- Example:
  - In terminal: pdfwordify dir/example.pdf.
  - In code: convert_to_docx("dir/example.pdf").
output_dir:
- Description: The path for the docx file. Can be either a folder path, a named path, or a full path specifying the file(docx) extension.
- Required: No
- Default: PDF file directory is used
- Example:
  - In terminal: pdfwordify dir/example.pdf /output/path/.
  - In code: convert_to_docx("dir/example.pdf", "/output/path/")
method:
- Description: Method for extracting tables from a file.
- Required: No
- Default: lattice
- Types:
  - lattice for tables that have distinct boundaries.
  - stream for tables that have clear borders.
  - None if there are no tables in the document.
- Example:
  - In terminal: pdfwordify --method stream dir/example.pdf.
  - In code: convert_to_docx("example.pdf", method=None).
lang:
- Description: Language for extracting text from images within a document using Google Tesseract OCR.
- Required: No
- Default: eng
- Note: It is possible to combine languages. For example: rus+eng
- Example:
  - In terminal: pdfwordify --lang rus+eng dir/example.pdf.
  - In code: convert_to_docx("example.pdf", lang="rus+eng").

Settings

To further customize the settings, edit the config.py file.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.1

Apr 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfwordify-0.0.1.tar.gz (175.3 kB view details)

Uploaded Apr 28, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfwordify-0.0.1-py3-none-any.whl (40.2 kB view details)

Uploaded Apr 28, 2024 Python 3

File details

Details for the file pdfwordify-0.0.1.tar.gz.

File metadata

Download URL: pdfwordify-0.0.1.tar.gz
Upload date: Apr 28, 2024
Size: 175.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for pdfwordify-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`610b27aa1f385606580dfd0b30442e67e513d5f128795972bc77c53a4200bb83`
MD5	`308eef003e3f6f3215294be3df7e48c4`
BLAKE2b-256	`254535a1bd525a7ca601bfc045c71761fde6e48ffaca68ff2304525ae1b31391`

See more details on using hashes here.

File details

Details for the file pdfwordify-0.0.1-py3-none-any.whl.

File metadata

Download URL: pdfwordify-0.0.1-py3-none-any.whl
Upload date: Apr 28, 2024
Size: 40.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for pdfwordify-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3829fbc15677b873d3a49fe2ad78e65b488b7a9265bab6e720e88f71729608c1`
MD5	`64c38263e6f5cfc69c322670a41de65f`
BLAKE2b-256	`ec6c2eef6fa81eb08e31c366ad7284eab8f853b432279eba5475a49be88d13be`

See more details on using hashes here.

pdfwordify 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdfwordify

Features

How to use

Arguments

Settings

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes