Skip to main content

Spark-Pdf is a library for processing documents using Apache Spark

Project description

GitHub StabRise

Spark Pdf

Spark-Pdf is a library for processing documents using Apache Spark.

It includes the following features:

  • Load PDF documents/Images
  • Extract text from PDF documents/Images
  • Extract images from PDF documents
  • OCR Images/PDF documents
  • Run NER on text extracted from PDF documents/Images
  • Visualize NER results

Installation

Requirements

  • Python 3.10
  • Apache Spark 3.5 or higher
  • Java 8
  • Tesseract 4.0 or higher
  pip install pyspark-pdf

Development

Setup

  git clone
  cd spark-pdf

Install dependencies

  poetry install

Run tests

  poetry run pytest --cov=sparkpdf --cov-report=html:coverage_report tests/ 

Build package

  poetry build

Build documentation

  poetry run sphinx-build -M html source build

Docker

Build image:

  docker build -t spark-pdf .

Run container:

  docker run --rm -it --entrypoint bash spark-pdf:latest

Release

  poetry version patch
  poetry publish --build

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_pdf-0.1.0rc8.tar.gz (183.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_pdf-0.1.0rc8-py3-none-any.whl (189.4 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_pdf-0.1.0rc8.tar.gz.

File metadata

  • Download URL: pyspark_pdf-0.1.0rc8.tar.gz
  • Upload date:
  • Size: 183.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.6 Linux/6.7.10-060710-generic

File hashes

Hashes for pyspark_pdf-0.1.0rc8.tar.gz
Algorithm Hash digest
SHA256 90ee8a170226b4d25596118662bfc09d81113008b7152a41d23b64e5e873c1a0
MD5 74f99e0e92c2d06916d6fb13b9599b31
BLAKE2b-256 00a3e4e35771ef15d87e7e3fc1b695d27a67ad841f10a581c0884eaedbe81aae

See more details on using hashes here.

File details

Details for the file pyspark_pdf-0.1.0rc8-py3-none-any.whl.

File metadata

  • Download URL: pyspark_pdf-0.1.0rc8-py3-none-any.whl
  • Upload date:
  • Size: 189.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.6 Linux/6.7.10-060710-generic

File hashes

Hashes for pyspark_pdf-0.1.0rc8-py3-none-any.whl
Algorithm Hash digest
SHA256 5f309df8a732e5c9db61d9d3849ef2884ee9cbe4c11c8caa6a1ab97c3d3bb652
MD5 7c4ec221d87bcc754be6c4428a3ae4f3
BLAKE2b-256 44ccd3a5c6bc7564da8ea9289be856b0e1613d102162e2bfcfcec8da4470c0be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page