A small package to extract text from pdf
Project description
PDF2Text Converter
This Python utility, pdf2text.py
, converts PDF documents into plain human-readable text format by processing line
breaks in words and sentences. The script is capable of reading other file types too, but it's specially equipped to
handle PDFs.
Features
- Extracts text content from PDF and other document formats supported by Apache Tika.
- Corrects word breaks that occur due to hyphenation supported by wordninja( e.g., "low- power" -> "low-power", "im- plement" -> "implement").
- Optionally corrects sentence breaks that occur due to newline characters.
Installation
-
Install the package and requirements by running:
pip install git+https://github.com/OnlyAR/pdf2text.git
or use ssh:
pip install git+ssh://git@github.com/OnlyAR/pdf2text.git
-
Make sure the java environment is installed and the correct path is configured to execute the
java
command.
Usage
To convert a file to text, use the pdf2text
function.
from pdf2text import pdf2text
file_path = 'path_to_your_pdf_file.pdf'
with open(file_path, 'rb') as file:
text_content = pdf2text(file, word_line_break=True, sentence_line_break=False)
print(text_content)
For further details and options, please refer to the Chinese .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file simple-pdf2text-0.0.1.tar.gz
.
File metadata
- Download URL: simple-pdf2text-0.0.1.tar.gz
- Upload date:
- Size: 3.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
3ba3587a3d9b9da28909656238672ac539ad951dc66d9f602f2d6890675fc861
|
|
MD5 |
cfa6841c0689942d7eba5b7b4ae34dec
|
|
BLAKE2b-256 |
4bb6542f1f2b6fa3d15d177b40f01cfd378cd8482f95cce7b2779d871490b14b
|
File details
Details for the file simple_pdf2text-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: simple_pdf2text-0.0.1-py3-none-any.whl
- Upload date:
- Size: 4.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
3d464d193d9aac45b2d4a763862f130494cb81b7105d335a232cdb8efe8d141f
|
|
MD5 |
e9c0d252bf0b59cf421dfcb7275d3924
|
|
BLAKE2b-256 |
7500c0cb593bdbd531444f76e8a72414f5179c61f3c1b8d2892098503f9ca8b9
|