pypdf
Project description
pypdf-lib
A (maybe) Better PDF Parsing for Python focused on textual extraction. WIP.
This library is a Python Wrapper built around PdfAct, which is built using Java.
Pre-requisites
Java
# Linux
apt-get update && apt-get install -y default-jre # openjdk-8-jre-headless / openjdk-11-jdk / openjdk-11-jre-headless
# Mac
brew install java
# Windows
# idk
Installation
!pip install --upgrade git+https://github.com/trisongz/pypdf-lib.git
Usage
from pypdf import PyPDF
from fileio import File
base_dir = '/content/output'
File.mkdirs(base_dir)
# Using a remap function expects the file extension to be mapped properly - i.e. if 'txt' is selected, .txt file extension should be returned.
def remap_fnames(fname):
fname = File.base(fname).replace('- ', '').replace(' ', '_').strip().replace('.pdf', '.json')
return File.join(base_dir, fname)
converter = PyPDF(input_dir='/content/inputs', output_dir='/content/output', units=['paragraphs', 'blocks'], visualize=True)
# remap_funct is optional.
for res in converter.extract(remap_funct=remap_fnames):
print(res)
# > /content/output/your_json_file_1.json
converter.extracted
'''
{'/content/inputs/input_1.pdf': '/content/output/your_json_file_1.json',
'/content/inputs/input_2.pdf': '/content/output/your_json_file_2.json',
'/content/inputs/input_3.pdf': '/content/output/your_json_file_3.json',
'params': {'exclude_roles': None,
'format': 'json',
'include_roles': ['title',
'body',
'appendix',
'keywords',
'heading',
'general_terms',
'toc',
'caption',
'table',
'other',
'categories',
'keywords',
'page_header'],
'units': ['paragraphs', 'blocks'],
'visualize': True,
'with_control_characters': False}}
'''
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pypdf-lib-0.0.1.tar.gz
(4.5 kB
view hashes)
Built Distribution
Close
Hashes for pypdf_lib-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45164ef666eab3e73f1e6b72f572d24afe54478bad8e0138ea07d498bd670d88 |
|
MD5 | 1b12d58dde1c4bda997c3da31865991f |
|
BLAKE2b-256 | 696e7ebbb9b51914fa2410b19f13a4715d5df3042ada17c6034f2bc4904beb40 |