pypdf
Project description
pypdf-lib
A (maybe) Better PDF Parsing for Python focused on textual extraction. WIP.
This library is a Python Wrapper built around PdfAct, which is built using Java.
Pre-requisites
Java
# Linux
apt-get update && apt-get install -y default-jre # openjdk-8-jre-headless / openjdk-11-jdk / openjdk-11-jre-headless
# Mac
brew install java
# Windows
# idk
Installation
!pip install --upgrade git+https://github.com/trisongz/pypdf-lib.git
!pip install --upgrade pypdf-lib
Usage
from pypdf import PyPDF
from fileio import File
base_dir = '/content/output'
File.mkdirs(base_dir)
# Using a remap function expects the file extension to be mapped properly - i.e. if 'txt' is selected, .txt file extension should be returned.
def remap_fnames(fname):
fname = File.base(fname).replace('- ', '').replace(' ', '_').strip().replace('.pdf', '.json')
return File.join(base_dir, fname)
converter = PyPDF(input_dir='/content/inputs', output_dir='/content/output', units=['paragraphs', 'blocks'], visualize=True)
# remap_funct is optional.
for res in converter.extract(remap_funct=remap_fnames):
print(res)
# > /content/output/your_json_file_1.json
converter.extracted
'''
{'/content/inputs/input_1.pdf': '/content/output/your_json_file_1.json',
'/content/inputs/input_2.pdf': '/content/output/your_json_file_2.json',
'/content/inputs/input_3.pdf': '/content/output/your_json_file_3.json',
'params': {'exclude_roles': None,
'format': 'json',
'include_roles': ['title',
'body',
'appendix',
'keywords',
'heading',
'general_terms',
'toc',
'caption',
'table',
'other',
'categories',
'keywords',
'page_header'],
'units': ['paragraphs', 'blocks'],
'visualize': True,
'with_control_characters': False}}
'''
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pypdf-lib-0.0.3.tar.gz
(5.3 kB
view hashes)
Built Distribution
Close
Hashes for pypdf_lib-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad08a84b21e4940151b0dd6da58488515f9e8cba5eeec85c10ca8e3596122446 |
|
MD5 | 28328ff3546c749f680e978688cf3416 |
|
BLAKE2b-256 | 45b4fd6c6c11e3e45900a126ca31ab56b682062e70f2b34c9fd515888b7c215a |