ArborParser is a Python library that parses structured text with hierarchical headings into tree representations, enabling customizable pattern recognition and multi-format exports for outlines, reports, and technical documents.

These details have not been verified by PyPI

Project links

Project description

ArborParser

ArborParser is a powerful Python library designed to parse structured text documents and convert them into a tree representation based on hierarchical headings. It intelligently handles various numbering schemes and document inconsistencies, making it ideal for processing outlines, reports, technical documentation, legal texts, and more.

Features

Chain Parsing: Converts text into a linear sequence (ChainNode list) representing the document's hierarchical structure.
Multi-Candidate Parsing: parse_to_multi_chain keeps every heading candidate per line and the rest of the toolkit (tree builder/exporter) works directly on the resulting List[List[ChainNode]].
Flexible Pattern Definition: Define custom parsing patterns using regular expressions and specific number converters (Arabic, Roman, Chinese, Letters, Circled).
Built-in Patterns: Provides ready-to-use patterns for common heading styles (1.2.3, Chapter 1, 第一章, etc.).
Robust Tree Building: Transforms the linear chain into a true hierarchical TreeNode structure.
Automatic Error Correction: Includes an AutoPruneStrategy to intelligently handle skipped heading levels or lines mistakenly identified as headings.
Node Manipulation: Allows merging content between nodes (concat_node merge_all_children) for post-processing.
Reversible Transformation: Preserves original text, enabling full document reconstruction from the tree (tree.get_full_content()).
Export Capabilities: Outputs the parsed structure in various formats (e.g., human-readable tree view).

Example Transformation:

Original Text

Chapter 1 Animals
1.1 Mammals
1.1.1 Primates
1.2 Reptiles
Chapter 2 Plants
2.1 Angiosperms

Chain Structure (Intermediate)

LEVEL-[]: ROOT
LEVEL-[1]: Animals
LEVEL-[1, 1]: Mammals
LEVEL-[1, 1, 1]: Primates
LEVEL-[1, 2]: Reptiles
LEVEL-[2]: Plants
LEVEL-[2, 1]: Angiosperms

Tree Structure (Final)

ROOT
├─ Chapter 1 Animals
│   ├─ 1.1 Mammals
│   │   └─ 1.1.1 Primates
│   └─ 1.2 Reptiles
└─ Chapter 2 Plants
    └─ 2.1 Angiosperms

Installation

pip install arborparser

Basic Usage

from arborparser.chain import ChainParser
from arborparser.tree import TreeBuilder, TreeExporter, AutoPruneStrategy
from arborparser.pattern import ENGLISH_CHAPTER_PATTERN_BUILDER, NUMERIC_DOT_PATTERN_BUILDER

test_text = """
Chapter 1 Animals
1.1 Mammals
1.1.1 Primates
1.2 Reptiles
Chapter 2 Plants
2.1 Angiosperms
"""

# 1. Define parsing patterns
patterns = [
    ENGLISH_CHAPTER_PATTERN_BUILDER.build(),
    NUMERIC_DOT_PATTERN_BUILDER.build(),
]

# 2. Parse text to chain
parser = ChainParser(patterns)
chain = parser.parse_to_chain(test_text)

# 3. Build tree (using AutoPrune for robustness)
builder = TreeBuilder(strategy=AutoPruneStrategy())
tree = builder.build_tree(chain)

# 4. Print the structured tree
print(TreeExporter.export_tree(tree))

Multi-Chain Parsing

Sometimes a line can match multiple heading patterns (or a converter can emit more than one hierarchy). Call ChainParser.parse_to_multi_chain to preserve every candidate per line and let downstream consumers decide which one to keep.

ambiguous_text = """
Chapter 2 Building Blocks
    Content for the second chapter.

2.1 A Component
    Details about the first component.

2.1.1 A details
    Details 1

2.1 .2 A details 2 [the title is corrupted due to OCR or other reasons]
    Details 2

2.2 2-Sided Materials B Component
    Details about the second component.
"""

non_strict = NUMERIC_DOT_PATTERN_BUILDER.modify(
    prefix_regex=r"[\#\s]*",
    suffix_regex=r"[\.\s]*",
    separator=r"[\.\s]+",
    is_sep_regex=True,
    min_level=2,
).build()

patterns = [
    ENGLISH_CHAPTER_PATTERN_BUILDER.build(),
    NUMERIC_DOT_PATTERN_BUILDER.build(),
    non_strict,
]

parser = ChainParser(patterns)
multi_chain = parser.parse_to_multi_chain(ambiguous_text)

print(TreeExporter.export_chain(multi_chain))

builder = TreeBuilder()
tree_from_multi = builder.build_tree(multi_chain)
print(TreeExporter.export_tree(tree_from_multi))

Sample output (abridged):

[LEVEL-[]: ROOT]
[LEVEL-[2]: Building Blocks]
[LEVEL-[2, 1]: A Component, LEVEL-[2, 1]: A Component]
[LEVEL-[2, 1, 1]: A details, LEVEL-[2, 1, 1]: A details]
[LEVEL-[2, 1]: 2 A details 2 [...], LEVEL-[2, 1, 2]: A details 2 [...]]
[LEVEL-[2, 2]: 2-Sided Materials B Component, LEVEL-[2, 2, 2]: -Sided Materials B Component]

ROOT
└─ Chapter 2 Building Blocks
    ├─ 2.1 A Component
    │   ├─ 2.1.1 A details
    │   └─ 2.1 .2 A details 2 [...]
    └─ 2.2 2-Sided Materials B Component

Key points:

Each outer list entry represents a text line (the first entry is still ROOT).
Each inner list is ordered by detection priority. TreeBuilder prefers candidates that immediately follow the previous node (is_imm_next), otherwise it falls back to the lowest pattern_priority.
TreeExporter.export_chain renders multi rows in square brackets so you can quickly spot OCR errors or ambiguous headings.

Key Features in Detail

Built-in & Custom Patterns

Quickly parse common formats using builders like NUMERIC_DOT_PATTERN_BUILDER, CHINESE_CHAPTER_PATTERN_BUILDER, etc., or define your own using PatternBuilder for full control over prefixes, suffixes, number types, and separators.

# Example: Match "Section A.", "Section B."
letter_section_pattern = PatternBuilder(
    prefix_regex=r"Section\s",
    number_type=NumberType.LETTER,
    suffix_regex=r"\."
).build()

Automatic Error Correction (AutoPruneStrategy)

Documents aren't always perfect. AutoPruneStrategy (the default for TreeBuilder) handles common issues like skipped heading numbers (e.g., 1.1 followed by 1.3) and prunes lines incorrectly matched as headings, ensuring a more robust parsing process compared to the StrictStrategy.

Okay, here is a dedicated section explaining AutoPruneStrategy using the provided example, formatted for a README without using Python code blocks for the illustration:

Automatic Error Correction (AutoPruneStrategy)

Real-world documents often contain structural inconsistencies that can challenge parsers. Common issues include:

Skipped Heading Levels: Authors might jump from 1.1 directly to 1.3, omitting 1.2.
False Positives: Regular text lines might accidentally match a heading pattern (e.g., a sentence mentioning "section 1.1").

The AutoPruneStrategy (used by default in TreeBuilder) is designed to handle these imperfections gracefully. It uses heuristics to identify likely errors and prune the intermediate structure, resulting in a more accurate final tree.

Example: Handling Imperfections

Consider the following text with a missing section (1.2) and a line of text containing 1.1 which could be mistaken for a heading:

Input Text:

Chapter 1 The Foundation
    Introductory content for the first chapter.

1.1 Core Concepts
    Explanation of the fundamental ideas.
    This section lays the groundwork.

# NOTE: Heading '1.2 Intermediate Concepts' is MISSING here.

1.3 Advanced Topics
    Discussing more complex subjects. We build upon the ideas from section
    1.1. This section is more advanced and goes into more detail.
    # NOTE: The '1.1.' here is text, not a heading.

Chapter 2 Building Blocks
    Content for the second chapter.

2.1 Component A
    Details about the first component.

2.2 Component B
    Details about the second component. End of document.

Intermediate Chain (Before Pruning):

A naive parsing step might initially produce a chain like this, including the misidentified heading:

LEVEL-[]: ROOT
LEVEL-[1]: The Foundation
LEVEL-[1, 1]: Core Concepts
LEVEL-[1, 3]: Advanced Topics
LEVEL-[1, 1]: This section is more advanced and goes into more detail.  <-- POTENTIAL FALSE POSITIVE
LEVEL-[2]: Building Blocks
LEVEL-[2, 1]: Component A
LEVEL-[2, 2]: Component B

How AutoPrune Works:

When building the tree, AutoPruneStrategy analyzes the sequence:

It recognizes that LEVEL-[1, 3] can logically follow LEVEL-[1, 1] even if [1, 2] is missing (sibling jump).
It sees the subsequent LEVEL-[1, 1] node ("This section...") followed by a completely different hierarchy (LEVEL-[2]). This discontinuity strongly suggests the second LEVEL-[1, 1] node was a false positive.
The strategy "prunes" the misidentified node, effectively merging its content back into the preceding valid node (LEVEL-[1, 3] in this case, depending on implementation details of content association).

Final Tree Structure (After AutoPrune):

The resulting tree correctly reflects the intended document structure:

ROOT
├─ Chapter 1 The Foundation
│   ├─ 1.1 Core Concepts
│   └─ 1.3 Advanced Topics  # Correctly handles the jump & ignored false positive
└─ Chapter 2 Building Blocks
    ├─ 2.1 Component A
    └─ 2.2 Component B

Node Operations & Reversibility

ArborParser works with ChainNode (linear sequence) and TreeNode (hierarchical tree) objects. Both inherit from BaseNode, which stores level_seq, title, and the original content string.

Concatenating Content: You can merge the content of one node into another. This is useful internally for associating non-heading text with its preceding heading or for merging nodes during error correction.
```
# Append node B's content to node A
node_a.concat_node(node_b)
```

Merging Children: A parent node can absorb the content of all its descendants.

# Make node_a contain its own content plus all content from its children/grandchildren...
node_a.merge_all_children()

Reconstructing Original Text: Because each node retains its original text chunk (content), you can reconstruct the entire original document from the root TreeNode. This verifies parsing integrity and allows regeneration after modification.
```
# Get the full text back from the parsed tree structure
reconstructed_text = root_node.get_full_content()
assert reconstructed_text == original_text # Verification
```

Potential Use Cases

Documentation Parsing
Legal Document Analysis (Laws, Contracts)
Outline Processing & Conversion
Report Structuring & Analysis
Content Management System Import
Data Extraction from Structured Text
Format Conversion (e.g., Text to HTML/XML preserving structure)
Better Chunking Strategies for RAG

Contributing

Contributions (pull requests, issues) are welcome!

License

MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.6

Nov 19, 2025

This version

0.1.5

Nov 14, 2025

0.1.4

Oct 28, 2025

0.1.3

Apr 29, 2025

0.1.2

Apr 8, 2025

0.1.1

Mar 28, 2025

0.1.0

Mar 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arborparser-0.1.5.tar.gz (30.6 kB view details)

Uploaded Nov 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arborparser-0.1.5-py3-none-any.whl (19.5 kB view details)

Uploaded Nov 14, 2025 Python 3

File details

Details for the file arborparser-0.1.5.tar.gz.

File metadata

Download URL: arborparser-0.1.5.tar.gz
Upload date: Nov 14, 2025
Size: 30.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for arborparser-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`9876a41de027399ef0ee8ef78aeedfb02abcdcc2fea2cae4d16285ea075ec5e7`
MD5	`086940c2cf164dae58164fdd81c1ea2d`
BLAKE2b-256	`d77fb66f45e7812663efa22cf3cda6e4f5dce5a401bd67c1d80adefa59136332`

See more details on using hashes here.

File details

Details for the file arborparser-0.1.5-py3-none-any.whl.

File metadata

Download URL: arborparser-0.1.5-py3-none-any.whl
Upload date: Nov 14, 2025
Size: 19.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for arborparser-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f3de31d300c6feeb36005cce3aa028505220346a9f370a180ae89080874cc5ac`
MD5	`32abb52b0338c0e0d708a2478511c026`
BLAKE2b-256	`2d0051f2c6dc5c1b0c0490f28b6fbec7f2a8fcea7b1eeb92997636ddf7412f58`

See more details on using hashes here.

arborparser 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ArborParser

Features

Installation

Basic Usage

Multi-Chain Parsing

Key Features in Detail

Built-in & Custom Patterns

Automatic Error Correction (AutoPruneStrategy)

Automatic Error Correction (AutoPruneStrategy)

Node Operations & Reversibility

Potential Use Cases

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes