A basic document parsing and loading utility.

These details have not been verified by PyPI

Project links

Project description

A basic document parsing and loading utility.

In its simplest form, the docp project is a (doc)ument (p)arsing library.

Written in CPython, the project wraps various lower-level libraries, helping to consolidate binary document structure parsing functionality into a single library. Additional functionality includes document loaders which load a parsed document's embeddings into a Chroma vector database, for RAG-enabled LLM use.

Installation

The easiest way to install docp is using pip after activating your virtual environment::

pip install docp

Additional (older) releases can be found either at PyPI or in GitHub Releases.

A note on the installation of dependencies:

To keep the installation dependencies to a minimum, only core libraries are required for installation. Meaning, the parser-specific and loader libraries are not installed automatically, as part of the pip install command.

If a parser is imported and a library is required but not installed, you'll be notified with an easy-to-read message, listing the required dependenc(y|ies).

The rationale behind this design decision is that not all users will need the document loading capability, so torch, langchain, etc. should not be installed automatically. For example, if your project requires a simple PDF parser, you don't need to (and likely don't want to) 'clutter' your environment with something as heavy as torch, nor make your project dependent on it.

The Toolset

Parsers

As of this release, parsers for the following binary document types are supported:

PDF
MS PowerPoint (PPTX)
(more coming soon)

Loaders

In addition to document parsing, document loading functionality is built-in as well. Specifically, loading documents into a Chroma vector database for RAG-enabled LLM ingestion.

For example, you may wish to load a series of PDF files into a vector database which serves as the backend for a RAG-enabled LLM chatbot. The ChromaLoader class is specifically designed for this. A single call to the class' loader method results in file retrieval, parsing, splitting, embedding and storage.

For further detail and usage examples, please refer to the project's documentation.

Using the Library

The documentation suite contains detailed explanation and example usage for each of the library's importable modules. For detailed documentation, usage examples and links the source code itself, please refer to the Library API page in the documentation.

Quickstart

For convenience, here are a couple examples for how to parse the supported document types.

Extract text from a PDF file:

>>> from docp import PDFParser

>>> pdf = PDFParser(path='/path/to/myfile.pdf')
>>> pdf.extract_text()

# Access the content of page 1.
>>> pg1 = pdf.doc.pages[1].content

Extract text from a PowerPoint presentation:

>>> from docp import PPTXParser

>>> pptx = PPTXParser(path='/path/to/myfile.pptx')
>>> pptx.extract_text()

# Access the text on slide 1.
>>> pg1 = pptx.doc.slides[1].content

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Feb 12, 2025

0.1.0b1 pre-release

Jan 16, 2025

0.0.0.dev1 pre-release

Jan 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docp-0.2.0.tar.gz (8.4 MB view details)

Uploaded Feb 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docp-0.2.0-py3-none-any.whl (68.1 kB view details)

Uploaded Feb 12, 2025 Python 3

File details

Details for the file docp-0.2.0.tar.gz.

File metadata

Download URL: docp-0.2.0.tar.gz
Upload date: Feb 12, 2025
Size: 8.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for docp-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6e3255de5a8b45de9e0a5e1ff0cc57c7198fc17acbd692b9c738c1a4aa5ae120`
MD5	`787fb4c302aded6346de35d4cbc40391`
BLAKE2b-256	`5ce1ac074f5dc568c5fcbd57b86fbabc5dc4919e989bac85d31fabf0d9cf11ec`

See more details on using hashes here.

File details

Details for the file docp-0.2.0-py3-none-any.whl.

File metadata

Download URL: docp-0.2.0-py3-none-any.whl
Upload date: Feb 12, 2025
Size: 68.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for docp-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`833dd9bb4ae5167bce5dae9b90467e9a4558467c80522758efbf9e6762ac7652`
MD5	`e23278c6a34656fab49b7f0b1812d50a`
BLAKE2b-256	`ccb0e41a74b290a5cc17fdcb906e5f47c597a862596a81907595e9c71814ff95`

See more details on using hashes here.

docp 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A basic document parsing and loading utility.

Installation

A note on the installation of dependencies:

The Toolset

Parsers

Loaders

Using the Library

Quickstart

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes