Skip to main content

Spark-Pdf is a library for processing documents using Apache Spark

Project description

GitHub StabRise

Spark Pdf

Spark-Pdf is a library for processing documents using Apache Spark.

It includes the following features:

  • Load PDF documents/Images
  • Extract text from PDF documents/Images
  • Extract images from PDF documents
  • OCR Images/PDF documents
  • Run NER on text extracted from PDF documents/Images
  • Visualize NER results

Installation

Requirements

  • Python 3.10
  • Apache Spark 3.5 or higher
  • Java 8
  • Tesseract 4.0 or higher
  pip install pyspark-pdf

Development

Setup

  git clone
  cd spark-pdf

Install dependencies

  poetry install

Run tests

  poetry run pytest --cov=sparkpdf --cov-report=html:coverage_report tests/ 

Build package

  poetry build

Build documentation

  poetry run sphinx-build -M html source build

Docker

Build image:

  docker build -t spark-pdf .

Run container:

  docker run --rm -it --entrypoint bash spark-pdf:latest

Release

  poetry version patch
  poetry publish --build

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_pdf-0.1.0rc7.tar.gz (55.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_pdf-0.1.0rc7-py3-none-any.whl (61.3 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_pdf-0.1.0rc7.tar.gz.

File metadata

  • Download URL: pyspark_pdf-0.1.0rc7.tar.gz
  • Upload date:
  • Size: 55.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.6 Linux/6.7.10-060710-generic

File hashes

Hashes for pyspark_pdf-0.1.0rc7.tar.gz
Algorithm Hash digest
SHA256 70f7ba7c947fded49ca68cc8ee5a9e7c5a27ba1abdb6666ed301b47773fb9ae1
MD5 378bcd88dda15288fec0be8744077c3f
BLAKE2b-256 4dd99913ed011a5417c160423c9c3c181eb1277f9bfa5fbb004c673da2ee5096

See more details on using hashes here.

File details

Details for the file pyspark_pdf-0.1.0rc7-py3-none-any.whl.

File metadata

  • Download URL: pyspark_pdf-0.1.0rc7-py3-none-any.whl
  • Upload date:
  • Size: 61.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.6 Linux/6.7.10-060710-generic

File hashes

Hashes for pyspark_pdf-0.1.0rc7-py3-none-any.whl
Algorithm Hash digest
SHA256 a5d3809dd17ae5492e7f0d879a9ab15716b26ab5e2c9cc4616ab1c37e25497b9
MD5 edf2873e5fae91de2a84f19668d0f2f2
BLAKE2b-256 71af4831929aaf87d42355a059895bd22d3c18d145080f9c2e483fe2c0407d61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page