Skip to main content

A Python package for Document Node Chunking based on Table of Contents

Project description

Node Chunker

A Python package for hierarchical document chunking based on Table of Contents or headers. Creates structured TextNode chunks for use with LlamaIndex.

Overview

The node_chunker package provides tools to intelligently split documents into semantically meaningful chunks by using their table of contents or header structure. The resulting hierarchy preserves the document's structure and creates parent-child relationships between chunks.

Key Features

  • Multi-Format Support: Process PDF, Markdown, HTML, Word documents, Jupyter Notebooks, and reStructuredText
  • PDF Chunking: Leverages the table of contents to create hierarchical chunks
  • Markdown Chunking: Uses headers to create structured document chunks
  • HTML Chunking: Extracts structure from HTML heading tags (h1-h6)
  • Word Document Chunking: Uses heading styles to structure content
  • Jupyter Notebook Chunking: Builds structure from markdown cell headers
  • RST Chunking: Creates chunks based on reStructuredText section structure
  • Hierarchical Structure: Maintains parent-child relationships between document sections
  • LlamaIndex Integration: Creates TextNodes with appropriate metadata and relationships
  • URL Support: Download and process documents directly from URLs
  • Metadata Preservation: Retains page numbers, section titles, and hierarchical context paths
  • Modular Installation: Install only the dependencies you need for specific document formats

Installation

The package name is node-chunker and will soon be available on PyPI.

GitHub Installation

pip install git+https://github.com/KameniAlexNea/llama-index-toc-parser.git@main

Coming Soon: PyPI Installation

Once available on PyPI, you'll be able to install with:

pip install node-chunker

Install with Specific Format Support

# Install only PDF and Markdown support
pip install "node-chunker[pdf,md]"

# Install HTML and Word document support
pip install "node-chunker[html,docx]"

# Install all format support
pip install "node-chunker[all]"


## Usage

### Basic Usage

```python
from node_chunker.chunks import chunk_document_by_toc_to_text_nodes
from node_chunker.chunks import DocumentFormat

# Process a PDF document (auto-detected by file extension)
pdf_nodes = chunk_document_by_toc_to_text_nodes("path/to/document.pdf")

# Process a Markdown document
markdown_nodes = chunk_document_by_toc_to_text_nodes(
    "path/to/document.md", 
    format_type=DocumentFormat.MARKDOWN
)

# Process an HTML document
html_nodes = chunk_document_by_toc_to_text_nodes(
    "path/to/document.html", 
    format_type=DocumentFormat.HTML
)

# Process a Word document
docx_nodes = chunk_document_by_toc_to_text_nodes(
    "path/to/document.docx", 
    format_type=DocumentFormat.DOCX
)

# Process a Jupyter notebook
jupyter_nodes = chunk_document_by_toc_to_text_nodes(
    "path/to/notebook.ipynb", 
    format_type=DocumentFormat.JUPYTER
)

# Process a reStructuredText document
rst_nodes = chunk_document_by_toc_to_text_nodes(
    "path/to/document.rst", 
    format_type=DocumentFormat.RST
)

# Process a PDF from a URL
url_nodes = chunk_document_by_toc_to_text_nodes(
    "https://example.com/document.pdf", 
    is_url=True
)

# Process raw markdown text
markdown_text = "# Title\nContent\n## Section\nMore content"
text_nodes = chunk_document_by_toc_to_text_nodes(
    markdown_text, 
    format_type=DocumentFormat.MARKDOWN
)

Format Selection

You can explicitly specify which document format to use:

from node_chunker.chunks import chunk_document_by_toc_to_text_nodes, DocumentFormat

# Use the DocumentFormat enum to specify format
nodes = chunk_document_by_toc_to_text_nodes(
    "content.txt",  # Content that's actually markdown
    format_type=DocumentFormat.MARKDOWN
)

# You can also use strings to specify the format
nodes = chunk_document_by_toc_to_text_nodes(
    "content.txt",
    format_type="md"
)

# Check which formats are available with your current dependencies
from node_chunker.chunks import get_supported_formats
available_formats = get_supported_formats()
print(f"Available formats: {available_formats}")

Working with TextNodes

The resulting TextNode objects contain:

  • The text content from each section
  • Metadata including titles, page numbers, and context paths
  • Parent-child relationships between sections
  • Source document references
# Examine the nodes
for node in pdf_nodes:
    print(f"Title: {node.metadata['title']}")
    print(f"Level: {node.metadata['level']}")
    if 'context' in node.metadata:
        print(f"Context path: {node.metadata['context']}")
    if 'page_label' in node.metadata:
        print(f"Pages: {node.metadata['page_label']}")
    print(f"Content: {node.text[:100]}...")
    print("---")

Command Line Interface

The package includes a simple CLI example in example/main.py:

python -m example.main --source document.pdf --verbose
python -m example.main --source document.md --markdown
python -m example.main --source https://example.com/doc.pdf --url

Document Chunking Classes

PDFTOCChunker

Chunks PDF documents based on their table of contents structure:

from node_chunker.pdf_chunking import PDFTOCChunker

chunker = PDFTOCChunker(pdf_path="document.pdf", source_display_name="document.pdf")
text_nodes = chunker.get_text_nodes()

MarkdownTOCChunker

Chunks Markdown documents based on header structure:

from node_chunker.md_chunking import MarkdownTOCChunker

with open("document.md", "r") as f:
    markdown_text = f.read()

chunker = MarkdownTOCChunker(markdown_text, source_display_name="document.md")
text_nodes = chunker.get_text_nodes()

HTMLTOCChunker

Chunks HTML documents based on heading tags:

from node_chunker.html_chunking import HTMLTOCChunker

with open("document.html", "r") as f:
    html_content = f.read()

chunker = HTMLTOCChunker(html_content, source_display_name="document.html")
text_nodes = chunker.get_text_nodes()

DOCXTOCChunker

Chunks Word documents based on heading styles:

from node_chunker.docx_chunking import DOCXTOCChunker

chunker = DOCXTOCChunker(docx_path="document.docx", source_display_name="document.docx")
text_nodes = chunker.get_text_nodes()

JupyterNotebookTOCChunker

Chunks Jupyter notebooks based on markdown cell headers:

from node_chunker.jupyter_chunking import JupyterNotebookTOCChunker

chunker = JupyterNotebookTOCChunker(notebook_path="notebook.ipynb", source_display_name="notebook.ipynb")
text_nodes = chunker.get_text_nodes()

RSTTOCChunker

Chunks reStructuredText documents based on section structure:

from node_chunker.rst_chunking import RSTTOCChunker

with open("document.rst", "r") as f:
    rst_content = f.read()

chunker = RSTTOCChunker(rst_content, source_display_name="document.rst")
text_nodes = chunker.get_text_nodes()

Why Use Node Chunker?

Traditional document chunking approaches often split documents based on fixed token counts or arbitrary boundaries, which can break the semantic integrity of the content. node_chunker preserves the logical structure of documents by:

  1. Respecting the author's own content organization (TOC/headers)
  2. Maintaining hierarchical relationships between sections
  3. Preserving metadata about document structure
  4. Creating chunks that align with human understanding of the document

This structure is particularly valuable for:

  • Question answering systems
  • Document summarization
  • Information retrieval applications
  • Knowledge graph construction

Requirements

  • Python 3.10+
  • llama-index-core
  • requests

Format-specific dependencies:

  • PDF: PyMuPDF (fitz)
  • HTML: BeautifulSoup4
  • Word: python-docx
  • Jupyter: nbformat
  • RST: docutils

Development

To set up the development environment:

git clone https://github.com/KameniAlexNea/llama-index-toc-parser.git
cd llama-index-toc-parser
pip install -e ".[dev,all]"

Run tests with:

tox

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

node_chunker-0.1.0.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

node_chunker-0.1.0-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file node_chunker-0.1.0.tar.gz.

File metadata

  • Download URL: node_chunker-0.1.0.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for node_chunker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a13bf09f5932f73d1aedc295aa468a3c5d758424e54dd542d5d417c7bdc1584a
MD5 4773a93c48a94e9f9869ffed8ac6db64
BLAKE2b-256 af18483808bdc1b986514a6620a4c33b7c0de32441af4288eede5311c66e7160

See more details on using hashes here.

Provenance

The following attestation bundles were made for node_chunker-0.1.0.tar.gz:

Publisher: python-publish.yml on KameniAlexNea/llama-index-toc-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file node_chunker-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: node_chunker-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for node_chunker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fe181f7464e70a0206208c2e72fe7e41a0f2cbc5f1e077a5dcab67a60e764d15
MD5 def844d2797fc01919d36c328b2a1d6a
BLAKE2b-256 e074ea0d0b5f6915816de9ee49a9b766fab2007f1d0bea604500b100b2e46d7d

See more details on using hashes here.

Provenance

The following attestation bundles were made for node_chunker-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on KameniAlexNea/llama-index-toc-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page