C++ chunker

These details have not been verified by PyPI

Project description

Semantic Text Chunker

Advanced semantic text chunking library that preserves meaning and context

Overview

This project provides a C++ library (with Python bindings via pybind11) for fast semantically chunking large bodies of text. It aims to split text into coherent, context-preserving segments using semantic similarity, discourse markers, and section detection, fast.

Features

Semantic chunking of text based on meaning and context
Adjustable chunk size and coherence thresholds
Extraction of chunk details: coherence scores, dominant topics, section types
Python bindings for easy integration
Memory-safe C++ implementation with comprehensive error handling

Installation

Prerequisites

C++17 compiler (GCC 7+, Clang 5+, or MSVC 2017+)
pybind11 (version 2.10.0+)
CMake (version 3.16+)
Python 3.7+

Installing pybind11

Option 1: Using pip (Recommended)

pip install pybind11

Option 2: Using conda

conda install pybind11

Option 3: From source

git clone https://github.com/pybind/pybind11.git
cd pybind11
pip install .

Option 4: System package manager

# Ubuntu/Debian
sudo apt-get install pybind11-dev

# macOS with Homebrew
brew install pybind11

# Arch Linux
sudo pacman -S pybind11

Build Instructions

Method 1: Using CMake (Recommended)

mkdir build
cd build
cmake ..
make

Method 2: Using pip (Automatic build)

pip install .

Method 3: Development installation

pip install -e .

Usage

Basic Usage

Simple Text Chunking

import chunker_cpp

text = """This is a long text for testing the semantic chunker. It contains multiple sentences, some of which are quite lengthy and elaborate, while others are short. The purpose of this text is to simulate a realistic document that might be processed by the chunker.

In addition to regular sentences, this text includes various structures such as lists:
- First item in the list.
- Second item, which is a bit longer and more descriptive.
- Third item.

There are also paragraphs separated by blank lines.

Here is a new paragraph. It discusses a different topic and is intended to test how the chunker handles paragraph boundaries. Sometimes, paragraphs can be very long, spanning several lines and containing a lot of information. Other times, they are short.

Finally, this text ends with a concluding sentence to ensure the chunker can handle the end of input gracefully."""

# Basic chunking with default parameters
chunks = chunker_cpp.chunk_text_semantically(text)
print("Number of chunks:", len(chunks))
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk[:100]}...")

Advanced Usage with Custom Parameters

# Custom chunking parameters
chunks = chunker_cpp.chunk_text_semantically(
    text,
    max_chunk_size=1000,      # Maximum characters per chunk
    min_chunk_size=200,       # Minimum characters per chunk
    min_coherence_threshold=0.5  # Higher threshold for more coherent chunks
)

Getting Detailed Chunk Information

# Get detailed information about each chunk
chunk_details = chunker_cpp.get_chunk_details(text)

for i, detail in enumerate(chunk_details):
    print(f"Chunk {i+1}:")
    print(f"  Text: {detail['text'][:100]}...")
    print(f"  Coherence Score: {detail['coherence_score']:.3f}")
    print(f"  Dominant Topics: {detail['dominant_topics']}")
    print(f"  Sentence Count: {detail['sentence_count']}")
    print(f"  Section Type: {detail['primary_section_type']}")
    print()

Using the Class Interface

# Create a chunker instance
chunker = chunker_cpp.SemanticTextChunker()

# Use the instance methods
chunks = chunker.chunk_text_semantically(text, max_chunk_size=1500)
details = chunker.get_chunk_details(text, min_coherence_threshold=0.4)

Error Handling and Memory Safety

The library includes comprehensive error handling and memory safety features:

import chunker_cpp

# Safe handling of edge cases
try:
    # Empty string
    result = chunker_cpp.chunk_text_semantically("")
    print("Empty string handled:", result)

    # Very large text
    large_text = "This is a test sentence. " * 10000
    result = chunker_cpp.chunk_text_semantically(large_text, 1000, 200)
    print("Large text processed successfully")

    # Special characters
    special_text = "Text with special chars: \x00\x01\x02\n\r\t" + "A" * 1000
    result = chunker_cpp.chunk_text_semantically(special_text)
    print("Special characters handled safely")

except Exception as e:
    print(f"Error: {e}")

Performance Considerations

import time

# Benchmarking chunking performance
text = "Your large text here..." * 100  # Create a large text

start_time = time.time()
chunks = chunker_cpp.chunk_text_semantically(text)
end_time = time.time()

print(f"Processed {len(text)} characters in {end_time - start_time:.3f} seconds")
print(f"Generated {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(chunk) for chunk in chunks) / len(chunks):.0f} characters")

API Reference

Functions

`chunk_text_semantically(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`

Chunk text semantically while preserving meaning and context.

Parameters:

text (str): The input text to be chunked
max_chunk_size (int): Maximum size for each chunk (default: 2000)
min_chunk_size (int): Minimum size for each chunk (default: 500)
min_coherence_threshold (float): Minimum coherence threshold (default: 0.3)

Returns:

List[str]: List of text chunks

`get_chunk_details(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`

Get detailed information about each chunk including coherence scores and topics.

Parameters:

text (str): The input text to be chunked
max_chunk_size (int): Maximum size for each chunk (default: 2000)
min_chunk_size (int): Minimum size for each chunk (default: 500)
min_coherence_threshold (float): Minimum coherence threshold (default: 0.3)

Returns:

List[Dict[str, Any]]: List of dictionaries containing chunk details

Class

`SemanticTextChunker`

Advanced semantic text chunking class that preserves meaning and context.

Methods:

chunk_text_semantically(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)
get_chunk_details(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)

Development

Building from Source

# Clone the repository
git clone [<repository-url>](https://github.com/Lumen-Labs/cpp-chunker.git)
cd cpp_chunker

# Install dependencies
pip install pybind11

# Build using CMake
mkdir build && cd build
cmake ..
make

# Or build using pip
pip install .

Running Tests

# Run basic functionality test
python test_cunk.py

# Run memory safety tests
python test_memory_safety.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.4

Sep 20, 2025

0.1.3

Sep 20, 2025

0.1.2

Sep 20, 2025

0.1.1

Sep 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cpp_chunker-0.1.4.tar.gz (11.6 kB view details)

Uploaded Sep 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cpp_chunker-0.1.4-cp312-cp312-macosx_11_0_arm64.whl (144.5 kB view details)

Uploaded Sep 20, 2025 CPython 3.12macOS 11.0+ ARM64

File details

Details for the file cpp_chunker-0.1.4.tar.gz.

File metadata

Download URL: cpp_chunker-0.1.4.tar.gz
Upload date: Sep 20, 2025
Size: 11.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.4 Darwin/24.6.0

File hashes

Hashes for cpp_chunker-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`9903ff0ca5f0c3eaede4a07db62ac3426f32bd506ed2b0a041157ddecb7bbc67`
MD5	`b02628e6b870b34eddb42af609f08b1f`
BLAKE2b-256	`ad835f9dab86c88ca09df1fb66ae001e42aaf5961011c0517297873712511678`

See more details on using hashes here.

File details

Details for the file cpp_chunker-0.1.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: cpp_chunker-0.1.4-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Sep 20, 2025
Size: 144.5 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.4 Darwin/24.6.0

File hashes

Hashes for cpp_chunker-0.1.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`e7bfd512d7a5d3147ff006cdd0e514aa1c6866623c46b37b7def286266c7e74c`
MD5	`2dbf8bcc9fc9a35d81927b2dc8e8c0fb`
BLAKE2b-256	`8e5785b3a430db230b88c540b8c6263c5d930597319d4179681e51fd72cf13a0`

See more details on using hashes here.

cpp-chunker 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Semantic Text Chunker

Overview

Features

Installation

Prerequisites

Installing pybind11

Option 1: Using pip (Recommended)

Option 2: Using conda

Option 3: From source

Option 4: System package manager

Build Instructions

Method 1: Using CMake (Recommended)

Method 2: Using pip (Automatic build)

Method 3: Development installation

Usage

Basic Usage

Simple Text Chunking

Advanced Usage with Custom Parameters

Getting Detailed Chunk Information

Using the Class Interface

Error Handling and Memory Safety

Performance Considerations

API Reference

Functions

chunk_text_semantically(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)

get_chunk_details(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)

Class

SemanticTextChunker

Development

Building from Source

Running Tests

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`chunk_text_semantically(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`

`get_chunk_details(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`

`SemanticTextChunker`