Skip to main content

Fast character-based boundary detection for sentence and paragraphs

Project description

CharBoundary

A modular library for segmenting text into sentences and paragraphs based on character-level features.

Features

  • Character-level text segmentation
  • Support for sentence and paragraph boundaries
  • Customizable window sizes for context
  • Support for abbreviations
  • Optimized for both accuracy and performance
  • Secure model serialization with skops

Installation

pip install charboundary

Or install with NumPy support for faster processing:

pip install charboundary[numpy]

Quick Start

Using the Pre-trained Models

CharBoundary comes with three pre-trained models of different sizes:

  • Small - Fast with a small footprint (5 token context window, 64 trees)
  • Medium - Default, balanced performance (7 token context window, 128 trees)
  • Large - Most accurate but larger and slower (9 token context window, 512 trees)
from charboundary import get_default_segmenter

# Get the pre-trained medium-sized segmenter (default)
segmenter = get_default_segmenter()

# Segment text into sentences and paragraphs
text = "Hello, world! This is a test. This is another sentence."

# Get list of sentences
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ["Hello, world!", "This is a test.", "This is another sentence."]

# Get list of paragraphs
paragraphs = segmenter.segment_to_paragraphs(text)
print(paragraphs)

You can also choose a specific model size based on your needs:

from charboundary import get_small_segmenter, get_large_segmenter

# For faster processing with smaller memory footprint
small_segmenter = get_small_segmenter()

# For highest accuracy (but larger memory footprint)
large_segmenter = get_large_segmenter()

The models are optimized for handling:

  • Quotation marks in the middle or at the end of sentences
  • Common abbreviations (including legal abbreviations)
  • Legal citations (e.g., "Brown v. Board of Education, 347 U.S. 483 (1954)")
  • Multi-line quotes
  • Enumerated lists (partial support)

For example, the models correctly handle quotes in the middle of sentences:

text = 'Creditors may also typically invoke these laws to void "constructive" fraudulent transfers.'
sentences = segmenter.segment_to_sentences(text)
# Output: ['Creditors may also typically invoke these laws to void "constructive" fraudulent transfers.']

Training Your Own Model

from charboundary import TextSegmenter

# Create a segmenter (will be initialized with default parameters)
segmenter = TextSegmenter()

# Train the model on sample data
training_data = [
    "This is a sentence.<|sentence|> This is another sentence.<|sentence|><|paragraph|>",
    "This is a new paragraph.<|sentence|> It has multiple sentences.<|sentence|><|paragraph|>"
]
segmenter.train(data=training_data)

# Segment text into sentences and paragraphs
text = "Hello, world! This is a test. This is another sentence."
segmented_text = segmenter.segment_text(text)
print(segmented_text)
# Output: "Hello, world!<|sentence|> This is a test.<|sentence|> This is another sentence.<|sentence|>"

# Get list of sentences
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ["Hello, world!", "This is a test.", "This is another sentence."]

Model Serialization with skops

CharBoundary uses skops for secure model serialization. This provides better security than pickle for sharing and loading models.

Saving Models

# Train a model
segmenter = TextSegmenter()
segmenter.train(data=training_data)

# Save the model with skops
segmenter.save("model.skops", format="skops")

Loading Models

# Load a model with security checks (default)
# This will reject loading custom types for security
segmenter = TextSegmenter.load("model.skops", use_skops=True)

# Load a model with trusted types enabled 
# Only use this with models from trusted sources
segmenter = TextSegmenter.load("model.skops", use_skops=True, trust_model=True)

Security Considerations

  • When loading models from untrusted sources, avoid setting trust_model=True
  • When loading fails with untrusted types, skops will list the untrusted types that need to be approved
  • The library will fall back to pickle if skops loading fails, but this is less secure

Configuration

Basic Configuration

You can customize the segmenter with various parameters:

from charboundary.segmenters import TextSegmenter, SegmenterConfig

config = SegmenterConfig(
    left_window=3,             # Size of left context window
    right_window=3,            # Size of right context window
    abbreviations=["Dr.", "Mr.", "Mrs.", "Ms."],  # Custom abbreviations
    model_type="random_forest",  # Type of model to use
    model_params={             # Parameters for the model
        "n_estimators": 100,
        "max_depth": 16,
        "class_weight": "balanced"
    },
    use_numpy=True,            # Use NumPy for faster processing
    cache_size=1024,           # Cache size for character encoding
    num_workers=4              # Number of worker processes
)

segmenter = TextSegmenter(config=config)

Advanced Features

The CharBoundary library includes sophisticated feature engineering tailored for text segmentation. These features help the model distinguish between actual sentence boundaries and other characters that may appear similar (like periods in abbreviations or quotes in the middle of sentences).

Key features include:

  1. Quotation Handling:

    • Quote balance tracking (detecting matched pairs of quotes)
    • Word completion detection for quotes
    • Multi-line quote recognition
  2. List and Enumeration Detection:

    • Recognition of enumerated list items ((1), (2), (a), (b), etc.)
    • Detection of list introductions (colons, phrases like "as follows:")
    • Special handling for semicolons in list structures
  3. Abbreviation Detection:

    • Comprehensive lists of common and domain-specific abbreviations
    • Legal abbreviations and citations
  4. Contextual Analysis:

    • Distinction between primary terminators (., !, ?) and secondary terminators (", ', :, ;)
    • Detection of lowercase letters following potential terminators
    • Analysis of surrounding context for sentence boundaries

These features enable the model to make intelligent decisions about text segmentation, particularly for complex cases like legal documents, technical texts, and documents with complex structure.

Working with Abbreviations

# Get current abbreviations
abbrevs = segmenter.get_abbreviations()

# Add new abbreviations
segmenter.add_abbreviation("Ph.D")

# Remove abbreviations
segmenter.remove_abbreviation("Dr.")

# Set a new list of abbreviations
segmenter.set_abbreviations(["Dr.", "Mr.", "Prof.", "Ph.D."])

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

charboundary-0.1.0.tar.gz (12.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

charboundary-0.1.0-py3-none-any.whl (12.6 MB view details)

Uploaded Python 3

File details

Details for the file charboundary-0.1.0.tar.gz.

File metadata

  • Download URL: charboundary-0.1.0.tar.gz
  • Upload date:
  • Size: 12.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.23

File hashes

Hashes for charboundary-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9d709a7f9c1adf5deb47672e65db537957cc32d681bdcf6b22d6a394bb1f6596
MD5 46143c83f7e7a13bef5897d6ef846abe
BLAKE2b-256 35d13f3b193e31f990f6f144597bb9ac2e2bf3e3d2f693325ec89628506b1b7c

See more details on using hashes here.

File details

Details for the file charboundary-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for charboundary-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fcfa9e96c4ad8a4eb78b937be33cbada3589d8de282fb41eec80a8358c769a68
MD5 7419d5712840835f66f431ce79a305cb
BLAKE2b-256 07adfb39db2a70e80ff55a0d0ab2d51aed9754dcba0cd1a7b6bac5e779792621

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page