Skip to main content

Spark-Pdf is a library for processing documents using Apache Spark

Project description

Build GitHub StabRise

Spark Pdf

Spark-Pdf is a library for processing documents using Apache Spark.

It includes the following features:

  • Load PDF documents/Images
  • Extract text from PDF documents/Images
  • Extract images from PDF documents
  • OCR Images/PDF documents
  • Run NER on text extracted from PDF documents/Images
  • Visualize NER results

Installation

Requirements

  • Python 3.11
  • Apache Spark 3.5 or higher
  • Java 8
  • Tesseract 5.0 or higher
  pip install spark-pdf

Development

Setup

  git clone
  cd spark-pdf

Install dependencies

  poetry install

Run tests

  poetry run pytest --cov=sparkpdf --cov-report=html:coverage_report tests/ 

Build package

  poetry build

Build documentation

  poetry run sphinx-build -M html source build

Docker

Build image:

  docker build -t spark-pdf .

Run container:

  docker run --rm -it --entrypoint bash spark-pdf:latest

Release

  poetry version patch
  poetry publish --build

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_pdf-0.1.0rc3.tar.gz (32.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_pdf-0.1.0rc3-py3-none-any.whl (39.6 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_pdf-0.1.0rc3.tar.gz.

File metadata

  • Download URL: pyspark_pdf-0.1.0rc3.tar.gz
  • Upload date:
  • Size: 32.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.6 Linux/6.7.10-060710-generic

File hashes

Hashes for pyspark_pdf-0.1.0rc3.tar.gz
Algorithm Hash digest
SHA256 a909ea7d87010ec64878f61936b692677f63d397221cc52e2e41cb1e5832bdba
MD5 b1b20be1f310e8382e9e04ba79cd8a9d
BLAKE2b-256 5f619281655e805b6345932385b14573acc3891e4af74d925c6693269470d779

See more details on using hashes here.

File details

Details for the file pyspark_pdf-0.1.0rc3-py3-none-any.whl.

File metadata

  • Download URL: pyspark_pdf-0.1.0rc3-py3-none-any.whl
  • Upload date:
  • Size: 39.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.6 Linux/6.7.10-060710-generic

File hashes

Hashes for pyspark_pdf-0.1.0rc3-py3-none-any.whl
Algorithm Hash digest
SHA256 58701b1c42c5882bbcccf16a5a4bd881c19ac591995384b0cc8699c12fafbd78
MD5 95376947d2156565a8f0f94aabea4f43
BLAKE2b-256 9ef2983a50aebbc6012b95954e75ef1bb384800af3c66b9b3e310873be50ea0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page