Skip to main content

Given werkzeug.FileStorage, fastapi.UploadFile or str file path as input it converts any image files(.pdf, .jpg, .png, .tiff) into list of PIL or numpy objects

Project description



doc_loader

documentation Discord PyPI Latest Release License Code style: black

What is it

doc_loader is a utility package for loading multiple types of documents in the form of images, it can be used to load images into Pillow or numpy formats and can load from in memory buffers as well as from file paths

Main Features

  • General purpose document loader which accepts .png, .jpg, .jpeg, .pdf, .tiff, .tif formats and outputs list of either PIL.Image objects or list of numpy arrays
  • Handles Password Protected PDF's
  • Applies Exif Orientation to .jpg and .png images if present
  • Input: fastapi.UploadFile, werkeug.FileStorage object or str (file path)
  • Output: List of images as PIL objects or numpy array

Where to get it

The source code is currently hosted on GitHub at: https://github.com/CapgeminiInventIDE/doc_loader

Binary installers for the latest released version are available at the Python package index

pip install doc_loader

Dependencies

License

Usage

  • pip install doc_loader
  • In your code where you need to you will be using doc_loader you can refer to below script as reference:
from doc_loader import DocumentLoader, OutputType
from werkzeug.datastructures import FileStorage
from fastapi import UploadFile

path = "/opt/working/src/tests/data/tmp.png"

# Open file using path
page_count, document = DocumentLoader.load(path, max_num_pages = 2, output_type=OutputType.NUMPY)
print(page_count, document)

# Open file using UploadFile
with open(path, "rb") as fp:
    upload_file = UploadFile(path, fp)
    page_count, document = DocumentLoader.load(upload_file, max_num_pages = 2, output_type=OutputType.NUMPY)

print(page_count, document)

# Open file using FileStorage
with open(path, "rb") as fp:
    file_storage = FileStorage(fp, filename=path)
    page_count, document = DocumentLoader.load(file_storage, max_num_pages = 2, output_type=OutputType.NUMPY)

print(page_count, document)

Optional features

  • extract_text_pdf - allows you to get text from a searchable pdf if possible, otherwise will raise an error that can be handled, to use this pip install doc_loader[pdf_text_extract]
from doc_loader import extract_text_pdf
from werkzeug.datastructures import FileStorage
from fastapi import UploadFile

path = "/opt/working/src/tests/data/is-doc-has-cgtext.pdf"

# Open file using path
page_count, document = extract_text_pdf(path, max_num_pages = 2)
print(page_count, document)

# Open file using UploadFile
with open(path, "rb") as fp:
    upload_file = UploadFile(path, fp)
    page_count, document = extract_text_pdf(upload_file, max_num_pages = 2)

print(page_count, document)

# Open file using FileStorage
with open(path, "rb") as fp:
    file_storage = FileStorage(fp, filename=path)
    page_count, document = extract_text_pdf(file_storage, max_num_pages = 2)

print(page_count, document)

Contributing to doc_loader

To contribute to doc_loader, follow these steps:

  1. Fork the repository
  2. Create a branch in your own fork: git checkout -b <branch_name>.
  3. Make your changes and commit them: git commit -m '<commit_message>'
  4. Push to the original branch: git push origin <project_name>/<location>
  5. Create the pull request back to our fork.

About Us

Capgemini Invent combines strategy, technology, data science and creative design to solve the most complex business and technology challenges.

Disruption is not new, but the pace of change is. The fourth industrial revolution is forcing businesses to rethink everything they know.

Leading organizations behave as living entities, constantly adapting to change. With invention at their core, they continuously redesign their business to generate new sources of value. Winning is about fostering inventive thinking to create what comes next.

Invent. Build. Transform.

This is why we have created Capgemini Invent, Capgemini’s new digital innovation, consulting and transformation global business line. Our multi-disciplinary team helps business leaders find new sources of value. We accelerate the process of turning ideas into prototypes and scalable real-world solutions; leveraging the full business and technology expertise of the Capgemini Group to implement at speed and scale.

The result is a coordinated approach to transformation, enabling businesses to create the products, services, customer experiences, and business models of the future.

We're Hiring!

Do you want to be part of the team that builds doc_loader and other great products at Capgemini Invent? If so, you're in luck! Capgemini Invent is currently hiring Data Scientists who love using data to drive their decisions. Take a look at our open positions and see if you're a fit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_loader-0.1.3.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

doc_loader-0.1.3-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file doc_loader-0.1.3.tar.gz.

File metadata

  • Download URL: doc_loader-0.1.3.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for doc_loader-0.1.3.tar.gz
Algorithm Hash digest
SHA256 09d5bd2e65ef703d0efb5755c28c63d6a00b48bc89b6d25d94284b7b4f037bc7
MD5 16576770465484b999c8f72f0b1feda7
BLAKE2b-256 5760e6bfc0ac02208a748d02aa212223b49375c69d854674a350866cd296c278

See more details on using hashes here.

File details

Details for the file doc_loader-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: doc_loader-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for doc_loader-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 070519dc300e9058be0e899d7d9a34213eca570979b6b38b8fbab5afeda75db9
MD5 20a937dbecb0f035c91494ae4e2e1ad0
BLAKE2b-256 2c63215b67449449238579cfcf41dd9ae0ba178c5fd90b58cf5abfe1a40356d5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page