Skip to main content

Defines standardized interfaces and data structures for text processing plugins, enabling modular NLP operations like sentence splitting, tokenization, and chunking within the cjm-plugin-system ecosystem.

Project description

cjm-text-plugin-system

Install

pip install cjm_text_plugin_system

Project Structure

nbs/
├── core.ipynb             # DTOs for text processing with character-level span tracking
├── plugin_interface.ipynb # Domain-specific plugin interface for text processing operations
└── storage.ipynb          # Standardized SQLite storage for text processing results with content hashing

Total: 3 notebooks

Module Dependencies

graph LR
    core["core<br/>Core Data Structures"]
    plugin_interface["plugin_interface<br/>Text Processing Plugin Interface"]
    storage["storage<br/>Text Processing Storage"]

    plugin_interface --> core

1 cross-module dependencies detected

CLI Reference

No CLI commands found in this project.

Module Overview

Detailed documentation for each module in the project:

Core Data Structures (core.ipynb)

DTOs for text processing with character-level span tracking

Import

from cjm_text_plugin_system.core import (
    TextSpan,
    TextProcessResult
)

Classes

@dataclass
class TextSpan:
    "Represents a segment of text with its original character coordinates."
    
    text: str  # The text content of this span
    start_char: int  # 0-indexed start position in original string
    end_char: int  # 0-indexed end position (exclusive)
    label: str = 'sentence'  # Span type: 'sentence', 'token', 'paragraph', etc.
    metadata: Dict[str, Any] = field(...)  # Additional span metadata
    
    def to_dict(self) -> Dict[str, Any]:  # Dictionary representation
        "Convert span to dictionary for serialization."
@dataclass
class TextProcessResult:
    "Container for text processing results."
    
    spans: List[TextSpan]  # List of text spans from processing
    metadata: Dict[str, Any] = field(...)  # Processing metadata

Text Processing Plugin Interface (plugin_interface.ipynb)

Domain-specific plugin interface for text processing operations

Import

from cjm_text_plugin_system.plugin_interface import (
    TextProcessingPlugin
)

Classes

class TextProcessingPlugin(PluginInterface):
    """
    Abstract base class for plugins that perform NLP operations.
    
    Extends PluginInterface with text processing requirements:
    - `execute`: Dispatch method for different text operations
    - `split_sentences`: Split text into sentence spans with character positions
    """
    
    def execute(
            self,
            action: str = "split_sentences",  # Operation to perform: 'split_sentences', 'tokenize', etc.
            **kwargs
        ) -> Dict[str, Any]:  # JSON-serializable result
        "Execute a text processing operation."
    
    def split_sentences(
            self,
            text: str,  # Input text to split
            **kwargs
        ) -> TextProcessResult:  # Result with TextSpan objects containing character indices
        "Split text into sentence spans with accurate character positions."

Text Processing Storage (storage.ipynb)

Standardized SQLite storage for text processing results with content hashing

Import

from cjm_text_plugin_system.storage import (
    TextProcessRow,
    TextProcessStorage
)

Classes

@dataclass
class TextProcessRow:
    "A single row from the text_jobs table."
    
    job_id: str  # Unique job identifier
    input_text: str  # Original input text
    input_hash: str  # Hash of input text in "algo:hexdigest" format
    config_hash: str  # Hash of the effective processing config used
    spans: Optional[List[Dict[str, Any]]]  # Processed text spans
    metadata: Optional[Dict[str, Any]]  # Processing metadata
    created_at: Optional[float]  # Unix timestamp
class TextProcessStorage:
    def __init__(
        self,
        db_path: str  # Absolute path to the SQLite database file
    )
    "Standardized SQLite storage for text processing results."
    
    def __init__(
            self,
            db_path: str  # Absolute path to the SQLite database file
        )
        "Initialize storage, create table, run migrations, and build indexes."
    
    def save(
            self,
            job_id: str,       # Unique job identifier
            input_text: str,   # Original input text
            input_hash: str,   # Hash of input text in "algo:hexdigest" format
            config_hash: str,  # Hash of the effective processing config
            spans: Optional[List[Dict[str, Any]]] = None,  # Processed text spans
            metadata: Optional[Dict[str, Any]] = None       # Processing metadata
        ) -> None
        "Save or replace a text processing result (upsert by input_hash + config_hash)."
    
    def save_with_logging(
            self,
            *,
            job_id: str,       # Unique job identifier
            input_text: str,   # Original input text
            input_hash: str,   # Hash of input text in "algo:hexdigest" format
            config_hash: str,  # Hash of the effective processing config
            spans: Optional[List[Dict[str, Any]]] = None,  # Processed text spans
            metadata: Optional[Dict[str, Any]] = None,      # Processing metadata
            logger: Optional[logging.Logger] = None          # Optional logger for success/failure messages
        ) -> bool:  # True if saved; False if the save failed (error logged, not raised)
        "Save a result, logging success/failure. Failures are logged and swallowed (returns False).

Centralizes the try/save/log/except block text-processing plugins reimplement
(e.g. NLTK's manual wrap). Returns True on success so callers can gate
post-save side effects on the result."
    
    def get_cached(
            self,
            input_hash: str,   # Content hash of the input text (the input identity)
            config_hash: str   # Hash of the effective processing config
        ) -> Optional[TextProcessRow]:  # Cached row or None
        "Retrieve a cached text processing result by input_hash + config_hash.

Content-correct by construction: text is passed by value, so input_hash
identifies the exact input. Different text or config misses."
    
    def get_by_job_id(
            self,
            job_id: str  # Job identifier to look up
        ) -> Optional[TextProcessRow]:  # Row or None if not found
        "Retrieve a text processing result by job ID."
    
    def list_jobs(
            self,
            limit: int = 100  # Maximum number of rows to return
        ) -> List[TextProcessRow]:  # List of text processing rows
        "List text processing jobs ordered by creation time (newest first)."
    
    def verify_input(
            self,
            job_id: str  # Job identifier to verify
        ) -> Optional[bool]:  # True if input matches, False if changed, None if not found
        "Verify the stored input text still matches its hash."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cjm_text_plugin_system-0.0.12.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cjm_text_plugin_system-0.0.12-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file cjm_text_plugin_system-0.0.12.tar.gz.

File metadata

  • Download URL: cjm_text_plugin_system-0.0.12.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for cjm_text_plugin_system-0.0.12.tar.gz
Algorithm Hash digest
SHA256 f1ccd6e9c51e582f7190531a0c76edd9060cdf0ba541e64ad218739b3ab3953a
MD5 b985a3981101c2614a04a57ab8fbc48b
BLAKE2b-256 7e455379668b0cdfbfff1c3d5deccd13f3028e73dc2bd69faed79118b6d7e4a5

See more details on using hashes here.

File details

Details for the file cjm_text_plugin_system-0.0.12-py3-none-any.whl.

File metadata

File hashes

Hashes for cjm_text_plugin_system-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 c674c200eb166a5d466e60b2703722a4e5fb859b059c9a4ec55020f36c9b3188
MD5 9d34fd76188093e79b3718b0b17c72be
BLAKE2b-256 54dac89da7e49a3cf2bdf6d75a1636dcc489eac9966fd004672cf6c94ffb0ca8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page