PDF parser and analyzer
Project description
pdfminer.rtl
This is a fork of pdfminer.six that attempts to add RTL support with python-bidi. This version is experimental and probably buggy. Please don't rely on it for critical projects.
Check out the full original documentation on Read the Docs.
Features
- (Added RTL support)
- Written entirely in Python.
- Parse, analyze, and convert PDF documents.
- Extract content as text, images, html or hOCR.
- PDF-1.7 specification support. (well, almost).
- CJK languages and vertical writing scripts support.
- Various font types (Type1, TrueType, Type3, and CID) support.
- Support for extracting images (JPG, JBIG2, Bitmaps).
- Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
- Support for RC4 and AES encryption.
- Support for AcroForm interactive form extraction.
- Table of contents extraction.
- Tagged contents extraction.
- Automatic layout analysis.
How to use
-
Install Python 3.8 or newer.
-
Install pdfminer.rtl.
pip install pdfminer.rtl
-
(Optionally) install extra dependencies for extracting images.
pip install 'pdfminer.rtl[image]'
-
Use the command-line interface to extract text from pdf.
pdf2txt.py example.pdf
-
Or use it with Python.
from pdfminer.high_level import extract_text
text = extract_text("example.pdf")
print(text)
Acknowledgement
This repository includes code from pyHanko
; the original license has been included here and to all the other contirbutors of the original project see here
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pdfminer.rtl-1.0.1.dev17-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8818a7ab398170096b6fc1b5c1c65b2b7fd3267c1fc9e6591b453886d94a721b |
|
MD5 | 1b8a124bd07c576be3d1f6cac0e84e41 |
|
BLAKE2b-256 | d6868e3ddbd61db7d9d51259ec47a6c4000ee8b6113d2c2bd14eecd0a0afe311 |