A Python package for Document Node Chunking based on Table of Contents

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Node Chunker

A Python package for hierarchical document chunking based on Table of Contents or headers. Creates structured TextNode chunks for use with LlamaIndex.

Overview

The node_chunker package provides tools to intelligently split documents into semantically meaningful chunks by using their table of contents or header structure. The resulting hierarchy preserves the document's structure and creates parent-child relationships between chunks.

Key Features

Multi-Format Support: Process PDF, Markdown, HTML, Word documents, Jupyter Notebooks, and reStructuredText
PDF Chunking: Leverages the table of contents to create hierarchical chunks
Markdown Chunking: Uses headers to create structured document chunks
HTML Chunking: Extracts structure from HTML heading tags (h1-h6)
Word Document Chunking: Uses heading styles to structure content
Jupyter Notebook Chunking: Builds structure from markdown cell headers
RST Chunking: Creates chunks based on reStructuredText section structure
Hierarchical Structure: Maintains parent-child relationships between document sections
LlamaIndex Integration: Creates TextNodes with appropriate metadata and relationships
URL Support: Download and process documents directly from URLs
Metadata Preservation: Retains page numbers, section titles, and hierarchical context paths
Modular Installation: Install only the dependencies you need for specific document formats

Installation

The package name is node-chunker and will soon be available on PyPI.

GitHub Installation

pip install git+https://github.com/KameniAlexNea/llama-index-toc-parser.git@main

Coming Soon: PyPI Installation

Once available on PyPI, you'll be able to install with:

pip install node-chunker

Install with Specific Format Support

# Install only PDF and Markdown support
pip install "node-chunker[pdf,md]"

# Install HTML and Word document support
pip install "node-chunker[html,docx]"

# Install all format support
pip install "node-chunker[all]"


## Usage

### Basic Usage

```python
from node_chunker.chunks import chunk_document_by_toc_to_text_nodes
from node_chunker.chunks import DocumentFormat

# Process a PDF document (auto-detected by file extension)
pdf_nodes = chunk_document_by_toc_to_text_nodes("path/to/document.pdf")

# Process a Markdown document
markdown_nodes = chunk_document_by_toc_to_text_nodes(
    "path/to/document.md", 
    format_type=DocumentFormat.MARKDOWN
)

# Process an HTML document
html_nodes = chunk_document_by_toc_to_text_nodes(
    "path/to/document.html", 
    format_type=DocumentFormat.HTML
)

# Process a Word document
docx_nodes = chunk_document_by_toc_to_text_nodes(
    "path/to/document.docx", 
    format_type=DocumentFormat.DOCX
)

# Process a Jupyter notebook
jupyter_nodes = chunk_document_by_toc_to_text_nodes(
    "path/to/notebook.ipynb", 
    format_type=DocumentFormat.JUPYTER
)

# Process a reStructuredText document
rst_nodes = chunk_document_by_toc_to_text_nodes(
    "path/to/document.rst", 
    format_type=DocumentFormat.RST
)

# Process a PDF from a URL
url_nodes = chunk_document_by_toc_to_text_nodes(
    "https://example.com/document.pdf", 
    is_url=True
)

# Process raw markdown text
markdown_text = "# Title\nContent\n## Section\nMore content"
text_nodes = chunk_document_by_toc_to_text_nodes(
    markdown_text, 
    format_type=DocumentFormat.MARKDOWN
)

Format Selection

You can explicitly specify which document format to use:

from node_chunker.chunks import chunk_document_by_toc_to_text_nodes, DocumentFormat

# Use the DocumentFormat enum to specify format
nodes = chunk_document_by_toc_to_text_nodes(
    "content.txt",  # Content that's actually markdown
    format_type=DocumentFormat.MARKDOWN
)

# You can also use strings to specify the format
nodes = chunk_document_by_toc_to_text_nodes(
    "content.txt",
    format_type="md"
)

# Check which formats are available with your current dependencies
from node_chunker.chunks import get_supported_formats
available_formats = get_supported_formats()
print(f"Available formats: {available_formats}")

Working with TextNodes

The resulting TextNode objects contain:

The text content from each section
Metadata including titles, page numbers, and context paths
Parent-child relationships between sections
Source document references

# Examine the nodes
for node in pdf_nodes:
    print(f"Title: {node.metadata['title']}")
    print(f"Level: {node.metadata['level']}")
    if 'context' in node.metadata:
        print(f"Context path: {node.metadata['context']}")
    if 'page_label' in node.metadata:
        print(f"Pages: {node.metadata['page_label']}")
    print(f"Content: {node.text[:100]}...")
    print("---")

Command Line Interface

The package includes a simple CLI example in example/main.py:

python -m example.main --source document.pdf --verbose
python -m example.main --source document.md --markdown
python -m example.main --source https://example.com/doc.pdf --url

Document Chunking Classes

PDFTOCChunker

Chunks PDF documents based on their table of contents structure:

from node_chunker.pdf_chunking import PDFTOCChunker

chunker = PDFTOCChunker(pdf_path="document.pdf", source_display_name="document.pdf")
text_nodes = chunker.get_text_nodes()

MarkdownTOCChunker

Chunks Markdown documents based on header structure:

from node_chunker.md_chunking import MarkdownTOCChunker

with open("document.md", "r") as f:
    markdown_text = f.read()

chunker = MarkdownTOCChunker(markdown_text, source_display_name="document.md")
text_nodes = chunker.get_text_nodes()

HTMLTOCChunker

Chunks HTML documents based on heading tags:

from node_chunker.html_chunking import HTMLTOCChunker

with open("document.html", "r") as f:
    html_content = f.read()

chunker = HTMLTOCChunker(html_content, source_display_name="document.html")
text_nodes = chunker.get_text_nodes()

DOCXTOCChunker

Chunks Word documents based on heading styles:

from node_chunker.docx_chunking import DOCXTOCChunker

chunker = DOCXTOCChunker(docx_path="document.docx", source_display_name="document.docx")
text_nodes = chunker.get_text_nodes()

JupyterNotebookTOCChunker

Chunks Jupyter notebooks based on markdown cell headers:

from node_chunker.jupyter_chunking import JupyterNotebookTOCChunker

chunker = JupyterNotebookTOCChunker(notebook_path="notebook.ipynb", source_display_name="notebook.ipynb")
text_nodes = chunker.get_text_nodes()

RSTTOCChunker

Chunks reStructuredText documents based on section structure:

from node_chunker.rst_chunking import RSTTOCChunker

with open("document.rst", "r") as f:
    rst_content = f.read()

chunker = RSTTOCChunker(rst_content, source_display_name="document.rst")
text_nodes = chunker.get_text_nodes()

Why Use Node Chunker?

Traditional document chunking approaches often split documents based on fixed token counts or arbitrary boundaries, which can break the semantic integrity of the content. node_chunker preserves the logical structure of documents by:

Respecting the author's own content organization (TOC/headers)
Maintaining hierarchical relationships between sections
Preserving metadata about document structure
Creating chunks that align with human understanding of the document

This structure is particularly valuable for:

Question answering systems
Document summarization
Information retrieval applications
Knowledge graph construction

Requirements

Python 3.10+
llama-index-core
requests

Format-specific dependencies:

PDF: PyMuPDF (fitz)
HTML: BeautifulSoup4
Word: python-docx
Jupyter: nbformat
RST: docutils

Development

To set up the development environment:

git clone https://github.com/KameniAlexNea/llama-index-toc-parser.git
cd llama-index-toc-parser
pip install -e ".[dev,all]"

Run tests with:

tox

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

alexneakameni

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

May 22, 2025

0.1.1

May 20, 2025

This version

0.1.0

May 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

node_chunker-0.1.0.tar.gz (18.8 kB view details)

Uploaded May 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

node_chunker-0.1.0-py3-none-any.whl (24.2 kB view details)

Uploaded May 19, 2025 Python 3

File details

Details for the file node_chunker-0.1.0.tar.gz.

File metadata

Download URL: node_chunker-0.1.0.tar.gz
Upload date: May 19, 2025
Size: 18.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for node_chunker-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a13bf09f5932f73d1aedc295aa468a3c5d758424e54dd542d5d417c7bdc1584a`
MD5	`4773a93c48a94e9f9869ffed8ac6db64`
BLAKE2b-256	`af18483808bdc1b986514a6620a4c33b7c0de32441af4288eede5311c66e7160`

See more details on using hashes here.

Provenance

The following attestation bundles were made for node_chunker-0.1.0.tar.gz:

Publisher: python-publish.yml on KameniAlexNea/llama-index-toc-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: node_chunker-0.1.0.tar.gz
- Subject digest: a13bf09f5932f73d1aedc295aa468a3c5d758424e54dd542d5d417c7bdc1584a
- Sigstore transparency entry: 214959499
- Sigstore integration time: May 19, 2025
Source repository:
- Permalink: KameniAlexNea/llama-index-toc-parser@fd930cfc461d6a14ea852606662a53fe9e94593c
- Branch / Tag: refs/tags/node-parser-v0.1.0
- Owner: https://github.com/KameniAlexNea
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@fd930cfc461d6a14ea852606662a53fe9e94593c
- Trigger Event: release

File details

Details for the file node_chunker-0.1.0-py3-none-any.whl.

File metadata

Download URL: node_chunker-0.1.0-py3-none-any.whl
Upload date: May 19, 2025
Size: 24.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for node_chunker-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fe181f7464e70a0206208c2e72fe7e41a0f2cbc5f1e077a5dcab67a60e764d15`
MD5	`def844d2797fc01919d36c328b2a1d6a`
BLAKE2b-256	`e074ea0d0b5f6915816de9ee49a9b766fab2007f1d0bea604500b100b2e46d7d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for node_chunker-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on KameniAlexNea/llama-index-toc-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: node_chunker-0.1.0-py3-none-any.whl
- Subject digest: fe181f7464e70a0206208c2e72fe7e41a0f2cbc5f1e077a5dcab67a60e764d15
- Sigstore transparency entry: 214959509
- Sigstore integration time: May 19, 2025
Source repository:
- Permalink: KameniAlexNea/llama-index-toc-parser@fd930cfc461d6a14ea852606662a53fe9e94593c
- Branch / Tag: refs/tags/node-parser-v0.1.0
- Owner: https://github.com/KameniAlexNea
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@fd930cfc461d6a14ea852606662a53fe9e94593c
- Trigger Event: release

node-chunker 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Node Chunker

Overview

Key Features

Installation

GitHub Installation

Coming Soon: PyPI Installation

Install with Specific Format Support

Format Selection

Working with TextNodes

Command Line Interface

Document Chunking Classes

PDFTOCChunker

MarkdownTOCChunker

HTMLTOCChunker

DOCXTOCChunker

JupyterNotebookTOCChunker

RSTTOCChunker

Why Use Node Chunker?

Requirements

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance