docitup is a Python package designed to simplify document processing for LangChain. It provides various loaders to extract content from different file types and convert them into LangChain-compatible document classes, ready for storage in LangChain-supported vector stores.
Project description
Docitup
This package provides various document loaders that utilize different methods for processing and chunking documents. It is designed to facilitate the loading of documents in various formats into a structured format suitable for using them with langchain vector databases
Overview
The package includes the following loaders:
- PyMUPdf4LLMLoader: Loads and splits documents from files using the
pymupdf4llmlibrary. - MarkitdownLoader: Loads documents using the
MarkItDownlibrary. - LlamaparseLoader: Loads documents using the
LlamaParselibrary and processes different file types. - DoclingPDFLoader: Converts documents to text and splits them accordingly.
Installation
To install this package, simply run:
pip install docitup
Usage
PyMUPdf4LLMLoader
from docitup import PyMUPdf4LLMLoader
loader = PyMUPdf4LLMLoader(file_path='path/to/your/file.pdf')
documents = loader.load()
MarkitdownLoader
from docitup import MarkitDownLoader
loader = MarkitdownLoader(file_path='path/to/your/file.md')
documents = loader.load()
LlamaparseLoader
from docitup import LlamaparseLoader
from llama_parse.utils import ResultType
loader = LlamaparseLoader(file_path='path/to/your/directory', result_type=ResultType.MD, api_key='your_api_key')
documents = loader.load()
DoclingPDFLoader
from docitup import DoclingLoader
loader = DoclingLoader(file_path='path/to/your/file.pdf')
documents = loader.load()
FitzPyMUPDFLoader
from docitup import FitzPyMUPDFLoader
loader = FitzPyMUPDFLoader(file_path='path/to/your/file.pdf')
documents = loader.load()
PyPdfLoader
from docitup import PyPdfLoader
loader = PyPdfLoader(file_path='path/to/your/file.pdf')
documents = loader.load()
PyPdfLoader2
from docitup import PyPdfLoader2
loader = PyPdf2Loader(file_path='path/to/your/file.pdf')
documents = loader.load()
Configuration Options
Each loader can be configured with the following optional parameters:
splitter_type: The type of text splitter to use ("recursive" or other).
chunk_size: The size of each chunk (default is 1000).
chunk_overlap: The number of overlapping characters between chunks (default is 100).
Example Usage with all parameters
from docitup import LlamaparseLoader
# Initialize the loader
loader = LlamaparseLoader(
file_path="example.pdf",
api_key="your_api_key",
splitter_type="recursive",
chunk_size=500,
chunk_overlap=50,
extra_metadata={"category": "example"}
)
# Load documents lazily
for document in loader.load():
print("Text Chunk:", document.text)
print("Metadata:", document.metadata)
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests for improvements or bug fixes.
License
This project is licensed under the MIT License. See the LICENSE file for more information.
Acknowledgements
This package is made possible by the following libraries:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docitup-0.1.3.tar.gz.
File metadata
- Download URL: docitup-0.1.3.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85322810bae1fc9d065966cfbca46d72a08f85d8ea93523e1079a45d2923ee57
|
|
| MD5 |
eca47723ba15ee4fb6140b1e6611f7fd
|
|
| BLAKE2b-256 |
7ba91bb6ce2a078a8b8fa44640e3a08127af368ac7ee920dff85324f664f09cf
|
File details
Details for the file docitup-0.1.3-py3-none-any.whl.
File metadata
- Download URL: docitup-0.1.3-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6fc653c093ba321b982e54e763c7337b3e63b717cd154f53fcf1fee4089492d4
|
|
| MD5 |
19ebf9afd1fa935d68ae7c5f787a68c2
|
|
| BLAKE2b-256 |
0503d30be16e571cc9cd7fa9cf77bf18817a0ab94ccc0c84ee420f28cd747078
|