Extract clean text from PDFs.

These details have not been verified by PyPI

Project links

Homepage

Project description

txt-from-pdf: Extract clean text from PDFs

Extracting text from pdfs using pdfminer.six and pypdf. Adapted from PDFextract.

Installation

pip install txt-from-pdf

Usage

from txtfrompdf import extract_txt_from_pdf

pdf_path = "file.pdf"
text = extract_txt_from_pdf(pdf_path)
print(text)

CLI Usage

Single file:

txt-from-pdf --input file.pdf --output extracted-text

Multiple files in a directory:

txt-from-pdf --input dir-with-pdfs --output extracted-text

Detailed help:

usage: txt-from-pdf [-h] --input INPUT [--output OUTPUT] [--no_filter] [--size SIZE]

txt-from-pdf CLI - Extracts cleaned text from PDF files

options:
  -h, --help       show this help message and exit
  --input INPUT    Path to a folder containing PDFs or to a single PDF file. (Required)
  --output OUTPUT  Output location for the extracted text files. (Optional, default: 'extracted_text')
  --no_filter      Turn off cleaning the resulting text files. (Optional)
  --size SIZE      Maximum file size per page in bytes for processing (mostly images). (Optional, default: 300000)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.3.1

Jul 27, 2024

1.3.0

Jul 26, 2024

1.2.3

Jul 25, 2024

1.2.2

Jul 25, 2024

1.2.1

Jul 24, 2024

This version

1.2.0

Jul 24, 2024

1.1.1

May 10, 2024

1.1.0

May 3, 2024

1.0.1

Apr 18, 2024

1.0.0

Apr 18, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

txt-from-pdf-1.2.0.tar.gz (15.5 kB view hashes)

Uploaded Jul 24, 2024 Source

Built Distribution

txt_from_pdf-1.2.0-py3-none-any.whl (16.7 kB view hashes)

Uploaded Jul 24, 2024 Python 3

Hashes for txt-from-pdf-1.2.0.tar.gz

Hashes for txt-from-pdf-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f9f0333947ec576e4dcea50f16c85a3f5520936b7cd8f2394fd99fad39a185c4`
MD5	`8f2d8610746d4239e9b86ec200389794`
BLAKE2b-256	`4d7a1ac88cfc323787f881d04228c894b4b76b4bf0810e8544fd38d9b34161f9`

Hashes for txt_from_pdf-1.2.0-py3-none-any.whl

Hashes for txt_from_pdf-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`209182b24bbfd0a2b0d329ab0dd32a342978f2cf98f89c351ee3f18a24771839`
MD5	`0d2263693475307383eb2b31b23ffdbc`
BLAKE2b-256	`0a6defa9f36a2765f45054dc087fde6ee3d0c898bfb90a825746a305d5656f89`