Skip to main content

A basic document parsing and loading utility.

Project description

A basic document parsing and loading utility.

PyPI - Version PyPI - Implementation PyPI - Python Version PyPI - Status Static Badge Static Badge Static Badge Documentation Status PyPI - License PyPI - Wheel

In its simplest form, the docp project is a (doc)ument (p)arsing library.

Written in CPython, the project wraps various lower-level libraries, helping to consolidate binary document structure parsing functionality into a single library. Additional functionality includes document loaders which load a parsed document's embeddings into a Chroma vector database, for RAG-enabled LLM use.

Installation

The easiest way to install docp is using pip after activating your virtual environment::

pip install docp

Additional (older) releases can be found either at PyPI or in GitHub Releases.

A note on the installation of dependencies:

To keep the installation dependencies to a minimum, only core libraries are required for installation. Meaning, the parser-specific and loader libraries are not installed automatically, as part of the pip install command.

If a parser is imported and a library is required but not installed, you'll be notified with an easy-to-read message, listing the required dependenc(y|ies).

The rationale behind this design decision is that not all users will need the document loading capability, so torch, langchain, etc. should not be installed automatically. For example, if your project requires a simple PDF parser, you don't need to (and likely don't want to) 'clutter' your environment with something as heavy as torch, nor make your project dependent on it.

The Toolset

Parsers

As of this release, parsers for the following binary document types are supported:

  • PDF
  • MS PowerPoint (PPTX)
  • (more coming soon)

Loaders

In addition to document parsing, document loading functionality is built-in as well. Specifically, loading documents into a Chroma vector database for RAG-enabled LLM ingestion.

For example, you may wish to load a series of PDF files into a vector database which serves as the backend for a RAG-enabled LLM chatbot. The ChromaLoader class is specifically designed for this. A single call to the class' loader method results in file retrieval, parsing, splitting, embedding and storage.

For further detail and usage examples, please refer to the project's documentation.

Using the Library

The documentation suite contains detailed explanation and example usage for each of the library's importable modules. For detailed documentation, usage examples and links the source code itself, please refer to the Library API page in the documentation.

Quickstart

For convenience, here are a couple examples for how to parse the supported document types.

Extract text from a PDF file:

>>> from docp import PDFParser

>>> pdf = PDFParser(path='/path/to/myfile.pdf')
>>> pdf.extract_text()

# Access the content of page 1.
>>> pg1 = pdf.doc.pages[1].content

Extract text from a PowerPoint presentation:

>>> from docp import PPTXParser

>>> pptx = PPTXParser(path='/path/to/myfile.pptx')
>>> pptx.extract_text()

# Access the text on slide 1.
>>> pg1 = pptx.doc.slides[1].content

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docp-0.2.0.tar.gz (8.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docp-0.2.0-py3-none-any.whl (68.1 kB view details)

Uploaded Python 3

File details

Details for the file docp-0.2.0.tar.gz.

File metadata

  • Download URL: docp-0.2.0.tar.gz
  • Upload date:
  • Size: 8.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for docp-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6e3255de5a8b45de9e0a5e1ff0cc57c7198fc17acbd692b9c738c1a4aa5ae120
MD5 787fb4c302aded6346de35d4cbc40391
BLAKE2b-256 5ce1ac074f5dc568c5fcbd57b86fbabc5dc4919e989bac85d31fabf0d9cf11ec

See more details on using hashes here.

File details

Details for the file docp-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: docp-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 68.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for docp-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 833dd9bb4ae5167bce5dae9b90467e9a4558467c80522758efbf9e6762ac7652
MD5 e23278c6a34656fab49b7f0b1812d50a
BLAKE2b-256 ccb0e41a74b290a5cc17fdcb906e5f47c597a862596a81907595e9c71814ff95

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page