Skip to main content

A basic document parsing utility. (Loaders)

Project description

A basic document parsing and loading utility - Loaders

PyPI - Version PyPI - Implementation PyPI - Python Version PyPI - Status Static Badge Static Badge Static Badge Documentation Status PyPI - License PyPI - Wheel

Overview

The docp-* project suite is designed as a comprehensive (doc)ument (p)arsing library. Built in CPython, it consolidates the capabilities of various lower-level libraries, offering a unified solution for parsing binary document structures.

The suite is extended by several sister projects, each providing unique functionality:

Project Description
docp-core Centralized core objects, functionality and settings.
docp-parsers Parse binary documents (e.g. PDF, PPTX, etc.) into Python objects.
docp-loaders Load a parsed document's embeddings into a Chroma vector database, for RAG-enabled LLM use.
docp-docling Convert a PDF into Markdown format via wrappers to the docling libraries.
docp-dbi Interfaces to document databases such as ChromaDB, and Neo4j (coming soon).

The Toolset (Loaders)

As of this release, loaders for the following binary document types are supported:

  • PDF
  • MS PowerPoint (PPTX)
  • (more coming soon)

Quickstart

Installation

To install docp-loaders, first activate your target virtual environment, then use pip:

pip install docp-loaders

For older releases, visit PyPI or the GitHub Releases page.

Example Usage

For convenience, here are a couple examples for how to parse and load the supported document types into a ChromaDB vector database.

Parse and load a single PDF file into a Chroma database collection:

>>> from docp_loaders import ChromaPDFLoader

>>> l = ChromaPDFLoader(path='/path/to/chroma',
                        collection='spam')
>>> l.load(path='/path/to/directory/myfile.pdf')

Parse and load a directory of PDF files into a Chroma database collection:

>>> from docp_loaders import ChromaPDFLoader

>>> l = ChromaPDFLoader(path='/path/to/chroma',
                        collection='spam')
>>> l.load(path='/path/to/directory', ext='pdf')

Parse and load a single PDF file into a Chroma database collection, offline using a local embedding model:

>>> from docp_loaders import ChromaPDFLoader

>>> l = ChromaPDFLoader(path='/path/to/chroma',
                        collection='spam',
                        offline=True,
                        embedding_model_path='/path/to/embedding-model-repo')
>>> l.load(path='/path/to/directory/myfile.pdf')

Parse and load a single PPTX file into a Chroma database collection:

>>> from docp_loaders import ChromaPPTXLoader

>>> l = ChromaPPTXLoader(path='/path/to/chroma',
                         collection='spam',
                         split_text=False)
>>> l.load(path='/path/to/directory/myfile.pptx')

Parse and load a directory of PPTX files into a Chroma database collection:

>>> from docp_loaders import ChromaPPTXLoader

>>> l = ChromaPPTXLoader(path='/path/to/chroma',
                         collection='spam',
                         split_text=False)
>>> l.load(path='/path/to/directory', ext='pptx')

Using the Library

The documentation suite provides detailed explanations and usage examples for each importable module. For in-depth documentation, code examples, and source links, refer to the Library API page.

A search field is available in the left navigation bar to help you quickly locate specific modules or methods.

Troubleshooting

No troubleshooting guidance is available at this time.

For questions not covered here, or to report bugs, issues, or suggestions, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docp_loaders-1.0.0.tar.gz (10.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docp_loaders-1.0.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file docp_loaders-1.0.0.tar.gz.

File metadata

  • Download URL: docp_loaders-1.0.0.tar.gz
  • Upload date:
  • Size: 10.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for docp_loaders-1.0.0.tar.gz
Algorithm Hash digest
SHA256 08a6091abd728b08a1d9d2edc12f11b473e819d61e70b49b5928bd3eed6192f4
MD5 e31e4a89a3cf2d73b548ebd3042da8a9
BLAKE2b-256 55c84942ee15c45dfff7603c6c2b4aba26296cc2c9eb464c61a1cf428c69e78c

See more details on using hashes here.

File details

Details for the file docp_loaders-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: docp_loaders-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for docp_loaders-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e62b6742481480af66b497e5b5f94faa1bcd7b003ac4e85b2e7231c1f9beb76c
MD5 0bc08a7a3ba7d5ffe449430b87e61009
BLAKE2b-256 48a80edc13e0664d625a9de1a4a4c748afb10e04f8ab7c20504df2976058549c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page