PDF parser and analyzer
Project description
pdfminer.rtl
This is a fork of pdfminer.six that attempts to add RTL support with python-bidi. This version is experimental and probably buggy. Please don't rely on it for critical projects.
Check out the full original documentation on Read the Docs.
Features
- (Added RTL support)
- Written entirely in Python.
- Parse, analyze, and convert PDF documents.
- Extract content as text, images, html or hOCR.
- PDF-1.7 specification support. (well, almost).
- CJK languages and vertical writing scripts support.
- Various font types (Type1, TrueType, Type3, and CID) support.
- Support for extracting images (JPG, JBIG2, Bitmaps).
- Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
- Support for RC4 and AES encryption.
- Support for AcroForm interactive form extraction.
- Table of contents extraction.
- Tagged contents extraction.
- Automatic layout analysis.
How to use
-
Install Python 3.8 or newer.
-
Install pdfminer.rtl.
pip install pdfminer.rtl
-
(Optionally) install extra dependencies for extracting images.
pip install 'pdfminer.rtl[image]'
-
Use the command-line interface to extract text from pdf.
pdf2txt.py example.pdf
-
Or use it with Python.
from pdfminer.high_level import extract_text
text = extract_text("example.pdf")
print(text)
Acknowledgement
This repository includes code from pyHanko
; the original license has been included here and to all the other contirbutors of the original project see here
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pdfminer.rtl-0.0.2.dev3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 77c244218e7f896b4b966cc5035199afe0e9caf35742f72a20021963930d9526 |
|
MD5 | 2fe580018796757d48260a4fcc8290d1 |
|
BLAKE2b-256 | b45a4b274f84d424f10656e353e8f57aef7cff93bfb0698c38116a0f4649858e |