Skip to main content

A small package to extract text from pdf

Project description

PDF2Text Converter

中文文档

This Python utility, pdf2text.py, converts PDF documents into plain human-readable text format by processing line breaks in words and sentences. The script is capable of reading other file types too, but it's specially equipped to handle PDFs.

Features

  • Extracts text content from PDF and other document formats supported by Apache Tika.
  • Corrects word breaks that occur due to hyphenation supported by wordninja( e.g., "low- power" -> "low-power", "im- plement" -> "implement").
  • Optionally corrects sentence breaks that occur due to newline characters.

Installation

  1. Install the package and requirements by running:

    pip install git+https://github.com/OnlyAR/pdf2text.git
    

    or use ssh:

    pip install git+ssh://git@github.com/OnlyAR/pdf2text.git
    
  2. Make sure the java environment is installed and the correct path is configured to execute the java command.

Usage

To convert a file to text, use the pdf2text function.

from pdf2text import pdf2text

file_path = 'path_to_your_pdf_file.pdf'
with open(file_path, 'rb') as file:
    text_content = pdf2text(file, word_line_break=True, sentence_line_break=False)
    print(text_content)

For further details and options, please refer to the Chinese .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple-pdf2text-0.0.1.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

simple_pdf2text-0.0.1-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file simple-pdf2text-0.0.1.tar.gz.

File metadata

  • Download URL: simple-pdf2text-0.0.1.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.10

File hashes

Hashes for simple-pdf2text-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3ba3587a3d9b9da28909656238672ac539ad951dc66d9f602f2d6890675fc861
MD5 cfa6841c0689942d7eba5b7b4ae34dec
BLAKE2b-256 4bb6542f1f2b6fa3d15d177b40f01cfd378cd8482f95cce7b2779d871490b14b

See more details on using hashes here.

File details

Details for the file simple_pdf2text-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for simple_pdf2text-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3d464d193d9aac45b2d4a763862f130494cb81b7105d335a232cdb8efe8d141f
MD5 e9c0d252bf0b59cf421dfcb7275d3924
BLAKE2b-256 7500c0cb593bdbd531444f76e8a72414f5179c61f3c1b8d2892098503f9ca8b9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page