Spark-Pdf is a library for processing documents using Apache Spark
Project description
Spark Pdf
Spark-Pdf is a library for processing documents using Apache Spark.
It includes the following features:
- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results
Installation
Requirements
- Python 3.10
- Apache Spark 3.5 or higher
- Java 8
- Tesseract 4.0 or higher
pip install pyspark-pdf
Development
Setup
git clone
cd spark-pdf
Install dependencies
poetry install
Run tests
poetry run pytest --cov=sparkpdf --cov-report=html:coverage_report tests/
Build package
poetry build
Build documentation
poetry run sphinx-build -M html source build
Docker
Build image:
docker build -t spark-pdf .
Run container:
docker run --rm -it --entrypoint bash spark-pdf:latest
Release
poetry version patch
poetry publish --build
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyspark_pdf-0.1.0rc8.tar.gz
(183.0 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyspark_pdf-0.1.0rc8.tar.gz.
File metadata
- Download URL: pyspark_pdf-0.1.0rc8.tar.gz
- Upload date:
- Size: 183.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.11.6 Linux/6.7.10-060710-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90ee8a170226b4d25596118662bfc09d81113008b7152a41d23b64e5e873c1a0
|
|
| MD5 |
74f99e0e92c2d06916d6fb13b9599b31
|
|
| BLAKE2b-256 |
00a3e4e35771ef15d87e7e3fc1b695d27a67ad841f10a581c0884eaedbe81aae
|
File details
Details for the file pyspark_pdf-0.1.0rc8-py3-none-any.whl.
File metadata
- Download URL: pyspark_pdf-0.1.0rc8-py3-none-any.whl
- Upload date:
- Size: 189.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.11.6 Linux/6.7.10-060710-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f309df8a732e5c9db61d9d3849ef2884ee9cbe4c11c8caa6a1ab97c3d3bb652
|
|
| MD5 |
7c4ec221d87bcc754be6c4428a3ae4f3
|
|
| BLAKE2b-256 |
44ccd3a5c6bc7564da8ea9289be856b0e1613d102162e2bfcfcec8da4470c0be
|