A basic document parsing and loading utility.
Project description
A basic document parsing and loading utility.
In its simplest form, the docp project is a (doc)ument (p)arsing library.
Written in CPython, the project wraps various lower-level libraries, helping to consolidate binary document structure parsing functionality into a single library. Additional functionality includes document loaders which load a parsed document's embeddings into a Chroma vector database, for RAG-enabled LLM use.
Installation
The easiest way to install docp is using pip after activating your virtual environment::
pip install docp
Additional (older) releases can be found either at PyPI or in GitHub Releases.
A note on the installation of dependencies:
To keep the installation dependencies to a minimum, only core libraries are required for installation. Meaning, the parser-specific and loader libraries are not installed automatically, as part of the pip install command.
If a parser is imported and a library is required but not installed, you'll be notified with an easy-to-read message, listing the required dependenc(y|ies).
The rationale behind this design decision is that not all users will need the document loading capability, so torch, langchain, etc. should not be installed automatically. For example, if your project requires a simple PDF parser, you don't need to (and likely don't want to) 'clutter' your environment with something as heavy as torch, nor make your project dependent on it.
The Toolset
Parsers
As of this release, parsers for the following binary document types are supported:
- MS PowerPoint (PPTX)
- (more coming soon)
Loaders
In addition to document parsing, document loading functionality is built-in as well. Specifically, loading documents into a Chroma vector database for RAG-enabled LLM ingestion.
For example, you may wish to load a series of PDF files into a vector database which serves as the backend for a RAG-enabled LLM chatbot. The ChromaLoader class is specifically designed for this. A single call to the class' loader method results in file retrieval, parsing, splitting, embedding and storage.
For further detail and usage examples, please refer to the project's documentation.
Using the Library
The documentation suite contains detailed explanation and example usage for each of the library's importable modules. For detailed documentation, usage examples and links the source code itself, please refer to the Library API page in the documentation.
Quickstart
For convenience, here are a couple examples for how to parse the supported document types.
Extract text from a PDF file:
>>> from docp import PDFParser
>>> pdf = PDFParser(path='/path/to/myfile.pdf')
>>> pdf.extract_text()
# Access the content of page 1.
>>> pg1 = pdf.doc.pages[1].content
Extract text from a PowerPoint presentation:
>>> from docp import PPTXParser
>>> pptx = PPTXParser(path='/path/to/myfile.pptx')
>>> pptx.extract_text()
# Access the text on slide 1.
>>> pg1 = pptx.doc.slides[1].content
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docp-0.2.0.tar.gz.
File metadata
- Download URL: docp-0.2.0.tar.gz
- Upload date:
- Size: 8.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e3255de5a8b45de9e0a5e1ff0cc57c7198fc17acbd692b9c738c1a4aa5ae120
|
|
| MD5 |
787fb4c302aded6346de35d4cbc40391
|
|
| BLAKE2b-256 |
5ce1ac074f5dc568c5fcbd57b86fbabc5dc4919e989bac85d31fabf0d9cf11ec
|
File details
Details for the file docp-0.2.0-py3-none-any.whl.
File metadata
- Download URL: docp-0.2.0-py3-none-any.whl
- Upload date:
- Size: 68.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
833dd9bb4ae5167bce5dae9b90467e9a4558467c80522758efbf9e6762ac7652
|
|
| MD5 |
e23278c6a34656fab49b7f0b1812d50a
|
|
| BLAKE2b-256 |
ccb0e41a74b290a5cc17fdcb906e5f47c597a862596a81907595e9c71814ff95
|