Skip to main content

No project description provided

Project description

Docitup

This package provides various document loaders that utilize different methods for processing and chunking documents. It is designed to facilitate the loading of documents in various formats into a structured format suitable for using them with langchain vector databases

Overview

The package includes the following loaders:

  • PyMUPdf4LLMLoader: Loads and splits documents from files using the pymupdf4llm library.
  • MarkitdownLoader: Loads documents using the MarkItDown library.
  • LlamaparseLoader: Loads documents using the LlamaParse library and processes different file types.
  • DoclingPDFLoader: Converts documents to text and splits them accordingly.

Installation

To install this package, simply run:

pip install docitup 

Usage

PyMUPdf4LLMLoader

from docitup.pymupdf4llm_loaders import PyMUPdf4LLMLoader 
  
loader = PyMUPdf4LLMLoader(file_path='path/to/your/file.pdf')  
documents = loader.load()   

MarkitdownLoader

from docitup.markitdown_loaders import MarkitDownLoader
  
loader = MarkitdownLoader(file_path='path/to/your/file.md')  
documents = loader.load()  

LlamaparseLoader

from docitup.llamaparse_loaders import LlamaparseLoader
from llama_parse.utils import ResultType
  
loader = LlamaparseLoader(file_path='path/to/your/directory', result_type=ResultType.MD, api_key='your_api_key')  
documents = loader.load()  

DoclingPDFLoader

from docitup.docling_loaders import DoclingLoader
  
loader = DoclingLoader(file_path='path/to/your/file.pdf')  
documents = loader.load()

Configuration Options

Each loader can be configured with the following optional parameters:

splitter_type: The type of text splitter to use ("recursive" or other).

chunk_size: The size of each chunk (default is 1000).

chunk_overlap: The number of overlapping characters between chunks (default is 100).

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests for improvements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Acknowledgements

This package is made possible by the following libraries:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docitup-0.1.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docitup-0.1.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file docitup-0.1.0.tar.gz.

File metadata

  • Download URL: docitup-0.1.0.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for docitup-0.1.0.tar.gz
Algorithm Hash digest
SHA256 19fc812ec3de72f3f62160e5686c2bef9f96818c3fc2b21beeeafcb287c6adc6
MD5 9818af4b8990c25894d009bbc044a93c
BLAKE2b-256 1e5a6378691c11f5a4b72023a4c1e2a6e42c4e6c1fdb0c2fe4c0e34db7c14b84

See more details on using hashes here.

File details

Details for the file docitup-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docitup-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for docitup-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2c0c9c3106a0bb628f0cf5871985640423340900cfb1c02bcdf9c2a7447e37b5
MD5 0eea946c302c9ae280059b65a6a3dc82
BLAKE2b-256 db5279955affb3e99255a5d295a98ea17b1963bd972acd7ed1202e444968e032

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page