Skip to main content

pypdf

Project description

pypdf-lib

A (maybe) Better PDF Parsing for Python focused on textual extraction. WIP.

This library is a Python Wrapper built around PdfAct, which is built using Java.

Pre-requisites

  • Java
# Linux
apt-get update && apt-get install -y default-jre # openjdk-8-jre-headless / openjdk-11-jdk / openjdk-11-jre-headless

# Mac
brew install java

# Windows
# idk

Installation

!pip install --upgrade git+https://github.com/trisongz/pypdf-lib.git
!pip install --upgrade pypdf-lib

Usage

from pypdf import PyPDF
from fileio import File

base_dir = '/content/output'
File.mkdirs(base_dir)

# Using a remap function expects the file extension to be mapped properly - i.e. if 'txt' is selected, .txt file extension should be returned.

def remap_fnames(fname):
    fname = File.base(fname).replace('- ', '').replace(' ', '_').strip().replace('.pdf', '.json')
    return File.join(base_dir, fname)

converter = PyPDF(input_dir='/content/inputs', output_dir='/content/output', units=['paragraphs', 'blocks'], visualize=True)

# remap_funct is optional. 
for res in converter.extract(remap_funct=remap_fnames):
    print(res)
    # > /content/output/your_json_file_1.json

converter.extracted
'''
{'/content/inputs/input_1.pdf': '/content/output/your_json_file_1.json',
 '/content/inputs/input_2.pdf': '/content/output/your_json_file_2.json',
 '/content/inputs/input_3.pdf': '/content/output/your_json_file_3.json',
 'params': {'exclude_roles': None,
  'format': 'json',
  'include_roles': ['title',
   'body',
   'appendix',
   'keywords',
   'heading',
   'general_terms',
   'toc',
   'caption',
   'table',
   'other',
   'categories',
   'keywords',
   'page_header'],
  'units': ['paragraphs', 'blocks'],
  'visualize': True,
  'with_control_characters': False}}
'''

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypdf-lib-0.0.3.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

pypdf_lib-0.0.3-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file pypdf-lib-0.0.3.tar.gz.

File metadata

  • Download URL: pypdf-lib-0.0.3.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.2

File hashes

Hashes for pypdf-lib-0.0.3.tar.gz
Algorithm Hash digest
SHA256 b7717df0955904ab549c84fe6fb2c9f4dc8006862b37540402c6c5455ff4b098
MD5 aad215188189e6bd8de8f4d30bb5de37
BLAKE2b-256 ed76b4cee7a204dce4f8b712b954bcf933f1229273ede59a80db9a4c0b4d8631

See more details on using hashes here.

File details

Details for the file pypdf_lib-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pypdf_lib-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.2

File hashes

Hashes for pypdf_lib-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ad08a84b21e4940151b0dd6da58488515f9e8cba5eeec85c10ca8e3596122446
MD5 28328ff3546c749f680e978688cf3416
BLAKE2b-256 45b4fd6c6c11e3e45900a126ca31ab56b682062e70f2b34c9fd515888b7c215a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page