Skip to main content

Extract clean text from PDFs.

Project description

txt-from-pdf: Extract clean text from PDFs

Github release PyPI version GitHub license

Extracting text from pdfs using pymupdf, but with a focus on cleaning and formatting the extracted text.

Installation

pip install txt-from-pdf

Usage

from txtfrompdf import extract_txt_from_pdf

pdf_path = "file.pdf"
text = extract_txt_from_pdf(pdf_path)
print(text)

CLI Usage

Single file:

txt-from-pdf --input file.pdf --output extracted-text 

Multiple files in a directory:

txt-from-pdf --input dir-with-pdfs --output extracted-text 

Detailed help:

usage: txt-from-pdf [-h] --input INPUT [--output OUTPUT] [--no_filter] [--size SIZE]

txt-from-pdf CLI - Extracts cleaned text from PDF files

options:
  -h, --help       show this help message and exit
  --input INPUT    Path to a folder containing PDFs or to a single PDF file. (Required)
  --output OUTPUT  Output location for the extracted text files. (Optional, default: 'extracted_text')
  --no_filter      Turn off cleaning the resulting text files. (Optional)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

txt-from-pdf-1.3.1.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

txt_from_pdf-1.3.1-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file txt-from-pdf-1.3.1.tar.gz.

File metadata

  • Download URL: txt-from-pdf-1.3.1.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.10

File hashes

Hashes for txt-from-pdf-1.3.1.tar.gz
Algorithm Hash digest
SHA256 47f6fde02d3e56dd10f4339948f19dd95dfa0edcc659c530646402e03c7e8608
MD5 bd74b77c9748c865047795cfc9750bff
BLAKE2b-256 a0f8af79d76ae77e71de7823f8235400b05709040822f5765668d96b63b7ed52

See more details on using hashes here.

File details

Details for the file txt_from_pdf-1.3.1-py3-none-any.whl.

File metadata

  • Download URL: txt_from_pdf-1.3.1-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.10

File hashes

Hashes for txt_from_pdf-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c76d9355b338ec5e73f2fefc8b61514378da82dba1a68bb6388ae89d305a9f9a
MD5 02d2837c7d2db2ed29bc6fe1523150f8
BLAKE2b-256 7e0ec517db8c0b02c708894a5d7a09b97b3a7e6b38f9fba964db7476dc5c5784

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page