pypdf
Project description
pypdf-lib
A (maybe) Better PDF Parsing for Python focused on textual extraction. WIP.
This library is a Python Wrapper built around PdfAct, which is built using Java.
Pre-requisites
Java
# Linux
apt-get update && apt-get install -y default-jre # openjdk-8-jre-headless / openjdk-11-jdk / openjdk-11-jre-headless
# Mac
brew install java
# Windows
# idk
Installation
!pip install --upgrade git+https://github.com/trisongz/pypdf-lib.git
!pip install --upgrade pypdf-lib
Usage
from pypdf import PyPDF
from fileio import File
base_dir = '/content/output'
File.mkdirs(base_dir)
# Using a remap function expects the file extension to be mapped properly - i.e. if 'txt' is selected, .txt file extension should be returned.
def remap_fnames(fname):
fname = File.base(fname).replace('- ', '').replace(' ', '_').strip().replace('.pdf', '.json')
return File.join(base_dir, fname)
converter = PyPDF(input_dir='/content/inputs', output_dir='/content/output', units=['paragraphs', 'blocks'], visualize=True)
# remap_funct is optional.
for res in converter.extract(remap_funct=remap_fnames):
print(res)
# > /content/output/your_json_file_1.json
converter.extracted
'''
{'/content/inputs/input_1.pdf': '/content/output/your_json_file_1.json',
'/content/inputs/input_2.pdf': '/content/output/your_json_file_2.json',
'/content/inputs/input_3.pdf': '/content/output/your_json_file_3.json',
'params': {'exclude_roles': None,
'format': 'json',
'include_roles': ['title',
'body',
'appendix',
'keywords',
'heading',
'general_terms',
'toc',
'caption',
'table',
'other',
'categories',
'keywords',
'page_header'],
'units': ['paragraphs', 'blocks'],
'visualize': True,
'with_control_characters': False}}
'''
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pypdf-lib-0.0.3.tar.gz
(5.3 kB
view details)
Built Distribution
File details
Details for the file pypdf-lib-0.0.3.tar.gz
.
File metadata
- Download URL: pypdf-lib-0.0.3.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7717df0955904ab549c84fe6fb2c9f4dc8006862b37540402c6c5455ff4b098 |
|
MD5 | aad215188189e6bd8de8f4d30bb5de37 |
|
BLAKE2b-256 | ed76b4cee7a204dce4f8b712b954bcf933f1229273ede59a80db9a4c0b4d8631 |
File details
Details for the file pypdf_lib-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: pypdf_lib-0.0.3-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad08a84b21e4940151b0dd6da58488515f9e8cba5eeec85c10ca8e3596122446 |
|
MD5 | 28328ff3546c749f680e978688cf3416 |
|
BLAKE2b-256 | 45b4fd6c6c11e3e45900a126ca31ab56b682062e70f2b34c9fd515888b7c215a |