Powerful and Pythonic PDF processing library based on xpdf-4.02
Project description
0.1.1 (2020-05-10)
- FIX: bug where default
Config.text_encoding
value i.e UTF-8 does not persistConfig.reset()
and changes to Latin1 - pdftotext: remove all parameters that change global
Config
properties
pyxpdf
Fast Python PDF parser module based on xpdf-reader sources.
Quickstart
from pyxpdf import Document, Page, Config
from pyxpdf.xpdf import TextControl
doc = Document("samples/nonfree/mandarin.pdf")
# or
# load pdf from file like object
with open("samples/nonfree/mandarin.pdf", 'rb') as fp:
doc = Document(fp)
# get pdf metadata dict
print(doc.info())
# >>> doc.info()
# {'CreationDate': "D:20080721141207-04'00'",
# 'Subject': 'Chinese Version of Universal PCXR8 ...',
# 'Author': 'SKC Inc.',
# 'Creator': 'PScript5.dll
# .....
# get all text
all_text = doc.text()
# iter first 10 pages
for page in doc[:10]:
# get page label if any
print(page.label)
# get page by page label
label_page = doc['1']
# get text in table layout without discarding clipped
# text.
text_control = TextControl("table", clip_text=True)
text = label_page.text(control=text_control)
# find case sensitive text within [x_min, y_min, x_max, y_max]
res_box = label_page.find_text('操作说明', search_box=[0, 0, 400, 400],
case_sensitive=True)
# >>> print(res_box)
# (281.88, 269.718, 354.05819999999994, 287.7)
# load xpdfrc
Config.load_file('my_xpdfrc')
# suppress stderr output for xpdf error log.
Config.error_quiet = False
pdftotext
If you are familiar with pdftotext binary then this is it's python port with almost native binary speed.
from pyxpdf import pdftotext
file = "sample.pdf"
# Get text from first two pages of pdf
pdf_text = pdftotext(file, start=1, end=2, layout="table",
userpass="1234", ownerpass="1234",
cfg_file="~/.xpdfrc")
Note:-
pdftotext
returns Unicode encoded string, so if your PDF contain characters outside of utf-8 then they will be ignored [decode('utf-8', errors='ignore')
].- If you are working with different encoding then you can use
pdftotext_raw
which has same function signature but returnsbytes
object. You can then decode it yourself but make sure to setConfig.text_encoding
to your encoding so that xpdf can properly extract text. Currently only 'UTF-8', 'Latin1', 'ASCII7', 'Symbol', 'ZapfDingbats' and 'UCS-2' encodings are predefined. To add additional encodings you can provide Unicode CMaps for your encoding throughxpdfrc
.
Install
pip install pyxpdf
Note (Windows):-
To build this in windows you will need Visual C++ compiler which you can get by installing Visual Studio Build Tools
Build Instructions
Requirements:-
- (CPython) Python 3.4+
- A recent enough C/C++ build environment
First clone the pyxpdf git repository:
$ git clone https://github.com/ashutoshvarma/pyxpdf.git
$ cd pyxpdf
Optionally create a virtualenv (recommended):
$ python -m venv <directory>
$ source <directory>/bin/activate
Then install the dependencies:
$ pip install -r test_requirements.txt
Build wheel
$ pip install wheel
$ python setup.py bdist_wheel --with-cython
Install wheel package
$ pip install dist/*.whl
Now you can run the tests
$ python runtests.py -v
License
pyxpdf
is licensed under the GNU General Public License (GPL), version 3. See the LICENSE
It uses following third party sources :-
- Xpdf Reader [https://www.xpdfreader.com/] by Derek Noonburg
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for pyxpdf-0.1.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 21dada8e0cc2b2bfbaf651096729a0697d6bd534e310c2f06d47e5641ae2e015 |
|
MD5 | ad5a4e278e5bfb5772197f57699ebecf |
|
BLAKE2b-256 | d902fb2bef07126eb97d0586822785625fbe5a58586598fa3c4399b5ad483aec |
Hashes for pyxpdf-0.1.1-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 00d099cd6b6d8d2d6b101db718c2be5a1aacb22566080ad8413b72fb272b55ae |
|
MD5 | 45b9363535b52523b6f02f7f4eb4468b |
|
BLAKE2b-256 | db178e6fbb57d7b52ce4aa0d7ede278d2e45305503417be4834cc5c140f6af5d |
Hashes for pyxpdf-0.1.1-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09b3227895126ed038cce9ee1b400364de0779f73d5da0bf6d7d8b488f4dcd3e |
|
MD5 | 8cd4800cb5a1d286193ed62400b041cf |
|
BLAKE2b-256 | 149e0aa10b20ce52abdc4ed91a37a8775b4587751af9e250e97d8d9732e65018 |
Hashes for pyxpdf-0.1.1-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a61fd5ce5b294c34e127e7e172551b5cbd40b75ba290f34da1b940f3ca27cf38 |
|
MD5 | ef36a6f04a4055f5430b9f331251de54 |
|
BLAKE2b-256 | 4554c66113a6f613ca98afc65998b2384069442f0b0c7d618f4eb4c122ceb425 |
Hashes for pyxpdf-0.1.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a99953efa596460043b0bea51d8dcdfa529b4f564e1dc0f39f2088ab7643e0ab |
|
MD5 | 975fe23e5b469b6e9e92f5b4ca2cefe1 |
|
BLAKE2b-256 | ace72f24215b1d4b85506086c7ef61fe76831a6892dcf9130e802e8af4e1ce0c |
Hashes for pyxpdf-0.1.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 897aab27cd0d4ba69ece70bea419ac8f8f33af2db621984a4e5ece87796f3ecd |
|
MD5 | d3ba765c95d74562fbbf6d447c20543c |
|
BLAKE2b-256 | 6aca97f89362481ba194b0c61c635b78a297c2cf2f0916d1f0d20516115aaade |
Hashes for pyxpdf-0.1.1-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | da509c40d210c04f83bad1dc0d71657d92c307f7fba43a4d613efcf9bb7e34cb |
|
MD5 | 4ccf262ce39e2bdac50573f37056224b |
|
BLAKE2b-256 | c60cd467f83510d32c07a6e1a515f54ddc9de45d7cc7b026f3fb82fea79dc1c9 |
Hashes for pyxpdf-0.1.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 81b96290fc724e97c1275bbe0c52bcd1cb7992b67d403e759f40daf3d92387a1 |
|
MD5 | 792e0d19c8447e9b48bde36db0b69999 |
|
BLAKE2b-256 | 7f255b5ecaa7e90f3a6610a6b9db0f1a76759333023a71da3a1c86d48f28cf1e |
Hashes for pyxpdf-0.1.1-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e7df5c358ec50b9b84a4e90d8a8d70b4c4ddf810951df864ae809f41c856552 |
|
MD5 | 32ee8a164d06cc62380ca346ebbb8f03 |
|
BLAKE2b-256 | 234b454e61b840ea0d22e2b75a6f69cbd8ea8ef2db42876affaea05321e2555c |
Hashes for pyxpdf-0.1.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e3b4b1dd1c423b80bd0dae18113a4c3fd96b349b9072a53f2cdf6d04756e942 |
|
MD5 | fc09817a787d15528bbb1ccaf258515e |
|
BLAKE2b-256 | 1ec7f321235684b1f3514f95169703368a33d7139c2dd7b7730fc44077965c25 |
Hashes for pyxpdf-0.1.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 070edcf67875afb2d2070051cfdaf1d263c065fb91c84e55b69677e88cf1b82d |
|
MD5 | 83bede22032b1c2fc8f0af340be137f5 |
|
BLAKE2b-256 | 1c2d2dd603d8dd60737570812c2575230d3e21307609546e319793763068bb6f |
Hashes for pyxpdf-0.1.1-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c3de0ca408de9b9a57de9f44f72db3668b95cffd76f76c58c6550c81907fe7a |
|
MD5 | aa9cb9e28d2825415064b79d3aa1654e |
|
BLAKE2b-256 | 30bec6ce1301c4b18f97d5d58ae7bbc8fffe00d4b01be5db920728dbab45acf3 |
Hashes for pyxpdf-0.1.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5676c2f3dcb66c64358274e3e9d6a8ca719dc115b29fb9836d1c4562edbcefa8 |
|
MD5 | 0ddb5f0c9a33d4c61dcf39217a42b7ee |
|
BLAKE2b-256 | 8ab247d4bea6a30cafd691279854815fecacb7937153e9af79ec6135deefc24f |
Hashes for pyxpdf-0.1.1-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d651052f9a6a319c5e34887d005ea8b742b34deb35e5028289e0e3a39426c68 |
|
MD5 | 2a4fdf37c39a49f80458e6bc15e6a9d8 |
|
BLAKE2b-256 | f6f0fd3e1e7844a86fcc5cd81ba2cf474faebd327e58cb5abda5e802c6a3131a |
Hashes for pyxpdf-0.1.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64abf615f5ab17a01c2fd6cfbbc8a91ed70ec308115444aca2361e4a0a1ab686 |
|
MD5 | 0cd6c974d2820d9786ab6abb0ed87c76 |
|
BLAKE2b-256 | 25dd73c001ddac2bc9e40e4fb8c819e4b8a78a300a955ae57bca60aadc3eae4e |
Hashes for pyxpdf-0.1.1-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad203855a406cd6b94145769276e23e82d33964b8f34a5eb44f62c5683fbc408 |
|
MD5 | d65725ee6921fb18143afffaf54376f1 |
|
BLAKE2b-256 | a9221dacd5dfa639fb9a84970b6899df97eb506643c9d5b7afdc24f6eba2661a |
Hashes for pyxpdf-0.1.1-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | acca154db7be848ac46604181835dbaa1c70d0934beefd8ec8e51efa1049d212 |
|
MD5 | 741f23e3feedb6e76bebd7fcec105775 |
|
BLAKE2b-256 | 075ce675d33eff864c0d2eaaf9f9b99a4c55764492367520461de10e29befaea |
Hashes for pyxpdf-0.1.1-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0375c294854f6eba8709ba625fd16eaca1421a69e331718927178200c55b1dc1 |
|
MD5 | da05d659ee2568288f1b09dba180d5a7 |
|
BLAKE2b-256 | dab447ac54e03e67108804f84fa4df1342a5ebf799b680a82f77737a3a338705 |
Hashes for pyxpdf-0.1.1-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 17b69a0f9661abb1cf8f4b6e6125d303f3eab48d5aa491754f5c6822a5e891b7 |
|
MD5 | eada382ecbad6680de604ea3c25aac10 |
|
BLAKE2b-256 | 0fa79f245a55918f288d7ffea0cc5f8fdbedd8d6243156764b25455b2a9854cc |
Hashes for pyxpdf-0.1.1-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8c0a9e375da03645fb4f71761476bee6660eb2ffa45d6be3e4e2874b66d9d41 |
|
MD5 | b0d95d4201fc368f7676ef848f41999c |
|
BLAKE2b-256 | d64831fac533488c44212d3a01be66135468eb09f435b4a0b6b07948f7a5da42 |