Skip to main content

A package for intelligently chunking structured documents into a hierarchical, contextual tree.

Project description

DocForest

DocForest is a Python library for intelligently chunking structured documents like Markdown and AsciiDoc. It organizes document content into a recursive, tree-like structure, ensuring that each chunk retains its full contextual path from its parent headings. This makes it an ideal tool for RAG (Retrieval-Augmented Generation) systems, semantic search, and other NLP tasks.


Features

  • Hierarchical Chunking: Splits documents based on heading levels, preserving the logical structure.
  • Context Preservation: Each section's content is linked to all its parent headings, providing rich context.
  • Flexible Output: Generates a structured "forest" or "tree" that is easy to traverse and process.
  • Support for Multiple Formats: Built to handle various structured document types.

Installation

Install docforest from PyPI:

pip install docforest

Usage

from docforest import DocForest, DocStyle

# Create a DocForest instance with the desired document style
forest = DocForest(style=DocStyle.MARKDOWN)

# chunk a document by giving its content
forest.chunk(content="content")

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docforest-0.1.1.tar.gz (4.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docforest-0.1.1-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file docforest-0.1.1.tar.gz.

File metadata

  • Download URL: docforest-0.1.1.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for docforest-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ab23deb66f3921989ad1a487663aebd1f23dc46c99c39eaa4499dd1d3a82c1af
MD5 2d451331d850481b1be3dad601099a11
BLAKE2b-256 ea659046a60d27a9e4e7956d4cf6439263698a66ded4c55e9684390df4940606

See more details on using hashes here.

File details

Details for the file docforest-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: docforest-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for docforest-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c77dc1da4583f35ef84835ea602f7f165db5da97b8adf866cdfb51ad9d3ec1f2
MD5 49b679f74f8bd8aca5daeeb58dd8c117
BLAKE2b-256 a310978d0558d2020dbb335bdb6bef3c27df32d124c0c9891d316f9b832831c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page