Fast character-based boundary detection for sentence and paragraphs

These details have not been verified by PyPI

Project description

CharBoundary

A modular library for segmenting text into sentences and paragraphs based on character-level features.

Features

Character-level text segmentation
Support for sentence and paragraph boundaries
Customizable window sizes for context
Support for abbreviations
Optimized for both accuracy and performance
Secure model serialization with skops

Installation

pip install charboundary

Or install with NumPy support for faster processing:

pip install charboundary[numpy]

Quick Start

Using the Pre-trained Models

CharBoundary comes with three pre-trained models of different sizes:

Small - Fast with a small footprint (5 token context window, 64 trees)
Medium - Default, balanced performance (7 token context window, 128 trees)
Large - Most accurate but larger and slower (9 token context window, 512 trees)

from charboundary import get_default_segmenter

# Get the pre-trained medium-sized segmenter (default)
segmenter = get_default_segmenter()

# Segment text into sentences and paragraphs
text = "Hello, world! This is a test. This is another sentence."

# Get list of sentences
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ["Hello, world!", "This is a test.", "This is another sentence."]

# Get list of paragraphs
paragraphs = segmenter.segment_to_paragraphs(text)
print(paragraphs)

You can also choose a specific model size based on your needs:

from charboundary import get_small_segmenter, get_large_segmenter

# For faster processing with smaller memory footprint
small_segmenter = get_small_segmenter()

# For highest accuracy (but larger memory footprint)
large_segmenter = get_large_segmenter()

The models are optimized for handling:

Quotation marks in the middle or at the end of sentences
Common abbreviations (including legal abbreviations)
Legal citations (e.g., "Brown v. Board of Education, 347 U.S. 483 (1954)")
Multi-line quotes
Enumerated lists (partial support)

For example, the models correctly handle quotes in the middle of sentences:

text = 'Creditors may also typically invoke these laws to void "constructive" fraudulent transfers.'
sentences = segmenter.segment_to_sentences(text)
# Output: ['Creditors may also typically invoke these laws to void "constructive" fraudulent transfers.']

Training Your Own Model

from charboundary import TextSegmenter

# Create a segmenter (will be initialized with default parameters)
segmenter = TextSegmenter()

# Train the model on sample data
training_data = [
    "This is a sentence.<|sentence|> This is another sentence.<|sentence|><|paragraph|>",
    "This is a new paragraph.<|sentence|> It has multiple sentences.<|sentence|><|paragraph|>"
]
segmenter.train(data=training_data)

# Segment text into sentences and paragraphs
text = "Hello, world! This is a test. This is another sentence."
segmented_text = segmenter.segment_text(text)
print(segmented_text)
# Output: "Hello, world!<|sentence|> This is a test.<|sentence|> This is another sentence.<|sentence|>"

# Get list of sentences
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ["Hello, world!", "This is a test.", "This is another sentence."]

Model Serialization with skops

CharBoundary uses skops for secure model serialization. This provides better security than pickle for sharing and loading models.

Saving Models

# Train a model
segmenter = TextSegmenter()
segmenter.train(data=training_data)

# Save the model with skops
segmenter.save("model.skops", format="skops")

Loading Models

# Load a model with security checks (default)
# This will reject loading custom types for security
segmenter = TextSegmenter.load("model.skops", use_skops=True)

# Load a model with trusted types enabled 
# Only use this with models from trusted sources
segmenter = TextSegmenter.load("model.skops", use_skops=True, trust_model=True)

Security Considerations

When loading models from untrusted sources, avoid setting trust_model=True
When loading fails with untrusted types, skops will list the untrusted types that need to be approved
The library will fall back to pickle if skops loading fails, but this is less secure

Configuration

Basic Configuration

You can customize the segmenter with various parameters:

from charboundary.segmenters import TextSegmenter, SegmenterConfig

config = SegmenterConfig(
    left_window=3,             # Size of left context window
    right_window=3,            # Size of right context window
    abbreviations=["Dr.", "Mr.", "Mrs.", "Ms."],  # Custom abbreviations
    model_type="random_forest",  # Type of model to use
    model_params={             # Parameters for the model
        "n_estimators": 100,
        "max_depth": 16,
        "class_weight": "balanced"
    },
    use_numpy=True,            # Use NumPy for faster processing
    cache_size=1024,           # Cache size for character encoding
    num_workers=4              # Number of worker processes
)

segmenter = TextSegmenter(config=config)

Advanced Features

The CharBoundary library includes sophisticated feature engineering tailored for text segmentation. These features help the model distinguish between actual sentence boundaries and other characters that may appear similar (like periods in abbreviations or quotes in the middle of sentences).

Key features include:

Quotation Handling:
- Quote balance tracking (detecting matched pairs of quotes)
- Word completion detection for quotes
- Multi-line quote recognition
List and Enumeration Detection:
- Recognition of enumerated list items ((1), (2), (a), (b), etc.)
- Detection of list introductions (colons, phrases like "as follows:")
- Special handling for semicolons in list structures
Abbreviation Detection:
- Comprehensive lists of common and domain-specific abbreviations
- Legal abbreviations and citations
Contextual Analysis:
- Distinction between primary terminators (., !, ?) and secondary terminators (", ', :, ;)
- Detection of lowercase letters following potential terminators
- Analysis of surrounding context for sentence boundaries

These features enable the model to make intelligent decisions about text segmentation, particularly for complex cases like legal documents, technical texts, and documents with complex structure.

Working with Abbreviations

# Get current abbreviations
abbrevs = segmenter.get_abbreviations()

# Add new abbreviations
segmenter.add_abbreviation("Ph.D")

# Remove abbreviations
segmenter.remove_abbreviation("Dr.")

# Set a new list of abbreviations
segmenter.set_abbreviations(["Dr.", "Mr.", "Prof.", "Ph.D."])

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.0

Apr 6, 2025

0.4.8

Apr 6, 2025

0.4.7

Apr 6, 2025

0.4.6

Apr 3, 2025

0.4.5

Apr 2, 2025

0.4.4

Apr 2, 2025

0.4.3

Apr 2, 2025

0.4.2

Apr 2, 2025

0.4.1

Apr 2, 2025

0.4.0

Apr 2, 2025

0.3.0

Mar 31, 2025

0.2.0

Mar 31, 2025

This version

0.1.0

Mar 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

charboundary-0.1.0.tar.gz (12.6 MB view details)

Uploaded Mar 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

charboundary-0.1.0-py3-none-any.whl (12.6 MB view details)

Uploaded Mar 30, 2025 Python 3

File details

Details for the file charboundary-0.1.0.tar.gz.

File metadata

Download URL: charboundary-0.1.0.tar.gz
Upload date: Mar 30, 2025
Size: 12.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.23

File hashes

Hashes for charboundary-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9d709a7f9c1adf5deb47672e65db537957cc32d681bdcf6b22d6a394bb1f6596`
MD5	`46143c83f7e7a13bef5897d6ef846abe`
BLAKE2b-256	`35d13f3b193e31f990f6f144597bb9ac2e2bf3e3d2f693325ec89628506b1b7c`

See more details on using hashes here.

File details

Details for the file charboundary-0.1.0-py3-none-any.whl.

File metadata

Download URL: charboundary-0.1.0-py3-none-any.whl
Upload date: Mar 30, 2025
Size: 12.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.23

File hashes

Hashes for charboundary-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fcfa9e96c4ad8a4eb78b937be33cbada3589d8de282fb41eec80a8358c769a68`
MD5	`7419d5712840835f66f431ce79a305cb`
BLAKE2b-256	`07adfb39db2a70e80ff55a0d0ab2d51aed9754dcba0cd1a7b6bac5e779792621`

See more details on using hashes here.

charboundary 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

CharBoundary

Features

Installation

Quick Start

Using the Pre-trained Models

Training Your Own Model

Model Serialization with skops

Saving Models

Loading Models

Security Considerations

Configuration

Basic Configuration

Advanced Features

Working with Abbreviations

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes