Skip to main content

No project description provided

Project description

Scale2Pdf

A library made at LAION to scale the parsing of PDFs on CPUs. We tested our pipeline on 44-page pdf and on cheap 2 thread CPU with 12GB ram, it took us 3 mins 22 seconds to parse the pdf, save its content both structure, bulk and its images. We provide following results through our framework

  1. Table extraction
  2. Equation extraction
  3. Image Captions
  4. Page extraction
  5. Keyword extraction
  6. Section extraction
  7. Authors extraction
  8. Bibliography extraction
  9. Paragraph extraction
  10. Image extraction
  11. Abstract extraction

Features

  1. added support for ray for scalability

Installation

pip install scale2pdf ray

then install

sudo apt install poppler-utils

from scale2pdf import scalablepdf 
from scale2pdf import extractimages

pdf_path = "/content/2408.06257v3.pdf"
scalablepdf(pdf_path, extract_images=True) # folder is automatically created and results are saved
# if you want to process a folder of pdfs with ray then
scalable_ray("example_folder", extract_images=True, num_cpus=4)
extractimages("2408.06257v3.pdf", "/path/to/output/folder")
Ray caution:

If you don't specify the CPU numbers then 4 CPU cores will be used at a time. You can increase it to the highest number of CPU cores available.

Speedup depends entirely on the CPU and resources available. I had used on a cheap CPU and it was bad since I had only two threads akin 2 CORE. (although threads here means core not threads themselves like in Computer Hardware)

CRAP CPU (NO GPU): 3 min 22 seconds to finish parsing and saving it to JSON.

A Sleeping AI framework made for friends at LAION AI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scale2pdf-0.0.7.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scale2pdf-0.0.7-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file scale2pdf-0.0.7.tar.gz.

File metadata

  • Download URL: scale2pdf-0.0.7.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.8

File hashes

Hashes for scale2pdf-0.0.7.tar.gz
Algorithm Hash digest
SHA256 2556a1fc89459fd6a77e644a78b742a84cf0d8a261b0b8a7bd0317564097b809
MD5 a884577a96230998b3bf921986ee9ce2
BLAKE2b-256 6872f5d5fe83e5f2134cb2b65b217dae978ba2deffe1ae25901a1b10632439df

See more details on using hashes here.

File details

Details for the file scale2pdf-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: scale2pdf-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.8

File hashes

Hashes for scale2pdf-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 ca0d73ef1f941c12a8eba4311b9755ec86f41a1799b2602e5cc269839a660916
MD5 8eeeac91308b33963f65e76be6f83d4f
BLAKE2b-256 2937d4df5966bfd07d31b85e8ff9115410ee2a434499b31e06a5f1f5e13fa9f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page