Skip to main content

A local, NLTK-based text processing worker for the cjm-plugin-system that provides sentence splitting and tokenization with character-level span tracking.

Project description

cjm-text-plugin-nltk

Install

pip install cjm_text_plugin_nltk

Project Structure

nbs/
├── meta.ipynb   # Metadata introspection for the NLTK text plugin used by cjm-ctl to generate the registration manifest.
└── plugin.ipynb # Plugin implementation for NLTK-based text processing with character-level span tracking

Total: 2 notebooks

Module Dependencies

graph LR
    meta["meta<br/>Metadata"]
    plugin["plugin<br/>NLTK Plugin"]

    plugin --> meta

1 cross-module dependencies detected

CLI Reference

No CLI commands found in this project.

Module Overview

Detailed documentation for each module in the project:

Metadata (meta.ipynb)

Metadata introspection for the NLTK text plugin used by cjm-ctl to generate the registration manifest.

Import

from cjm_text_plugin_nltk.meta import (
    get_plugin_metadata
)

Functions

def get_plugin_metadata() -> Dict[str, Any]:  # Plugin metadata for manifest generation
    """Return metadata required to register this plugin with the PluginManager."""
    # Fallback base path (current behavior for backward compatibility)
    base_path = os.path.dirname(os.path.dirname(sys.executable))
    
    # Use CJM config if available, else fallback to env-relative paths
    cjm_data_dir = os.environ.get("CJM_DATA_DIR")
    
    # Plugin data directory
    plugin_name = "cjm-text-plugin-nltk"
    if cjm_data_dir
    "Return metadata required to register this plugin with the PluginManager."

NLTK Plugin (plugin.ipynb)

Plugin implementation for NLTK-based text processing with character-level span tracking

Import

from cjm_text_plugin_nltk.plugin import (
    NLTKPluginConfig,
    NLTKPlugin
)

Classes

@dataclass
class NLTKPluginConfig:
    "Configuration for NLTK text processing plugin."
    
    tokenizer: str = field(...)
    language: str = field(...)
class NLTKPlugin:
    def __init__(self):
        """Initialize the NLTK plugin."""
        self.logger = logging.getLogger(f"{__name__}.{type(self).__name__}")
        self.config: NLTKPluginConfig = None
    "NLTK-based text processing plugin with character-level span tracking."
    
    def __init__(self):
            """Initialize the NLTK plugin."""
            self.logger = logging.getLogger(f"{__name__}.{type(self).__name__}")
            self.config: NLTKPluginConfig = None
        "Initialize the NLTK plugin."
    
    def name(self) -> str:  # Plugin name identifier
            """Get the plugin name identifier."""
            return "nltk_text"
        
        @property
        def version(self) -> str:  # Plugin version string
        "Get the plugin name identifier."
    
    def version(self) -> str:  # Plugin version string
            """Get the plugin version string."""
            from cjm_text_plugin_nltk import __version__
            return __version__
    
        def get_current_config(self) -> Dict[str, Any]:  # Current configuration as dictionary
        "Get the plugin version string."
    
    def get_current_config(self) -> Dict[str, Any]:  # Current configuration as dictionary
            """Return current configuration state."""
            if not self.config
        "Return current configuration state."
    
    def get_config_schema(self) -> Dict[str, Any]:  # JSON Schema for configuration
            """Return JSON Schema for UI generation."""
            return dataclass_to_jsonschema(NLTKPluginConfig)
    
        @staticmethod
        def get_config_dataclass() -> NLTKPluginConfig:  # Configuration dataclass
        "Return JSON Schema for UI generation."
    
    def get_config_dataclass() -> NLTKPluginConfig:  # Configuration dataclass
            """Return dataclass describing the plugin's configuration options."""
            return NLTKPluginConfig
        
        def _ensure_nltk_data(self) -> None
        "Return dataclass describing the plugin's configuration options."
    
    def initialize(
            self,
            config: Optional[Any] = None  # Configuration dataclass, dict, or None
        ) -> None
        "First-time setup. CR-4: kept fast (no slow work) so the substrate's load /
initialize timeout isn't exceeded — the NLTK data download is deferred to the
first _get_tokenizer() call (a cold-cache punkt/punkt_tab download takes several
seconds, which previously timed out the load when run here). The manual language
diff-and-reload is replaced by the declarative RELOAD_TRIGGER on `language`; the
substrate's reconfigure path fires _release_tokenizer then re-applies config."
    
    def execute(
            self,
            action: str = "split_sentences",  # Operation: 'split_sentences'
            **kwargs
        ) -> Dict[str, Any]:  # JSON-serializable result
        "Dispatch to the `@plugin_action`-tagged handler for `action` (SG-44)."
    
    def split_sentences(
            self,
            text: str,  # Input text to split into sentences
            **kwargs
        ) -> TextProcessResult:  # Result with TextSpan objects containing character indices
        "Split text into sentence spans with accurate character positions."
    
    def cleanup(self) -> None
        "Release resources on unload."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cjm_text_plugin_nltk-0.0.15.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cjm_text_plugin_nltk-0.0.15-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file cjm_text_plugin_nltk-0.0.15.tar.gz.

File metadata

  • Download URL: cjm_text_plugin_nltk-0.0.15.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for cjm_text_plugin_nltk-0.0.15.tar.gz
Algorithm Hash digest
SHA256 56c354a4a647e52c97ee87471303d9ae8aaca9bece0ad1c97f7cd9fe08928d3b
MD5 99cf46ca0f088df6ac6577a598aad16f
BLAKE2b-256 f521aaa64426e478abffb89dbea7b96a3b56bb2336f24f0afd4fde9864e5a46e

See more details on using hashes here.

File details

Details for the file cjm_text_plugin_nltk-0.0.15-py3-none-any.whl.

File metadata

File hashes

Hashes for cjm_text_plugin_nltk-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 cbc86d7c8f94a6110ad0d2bae212d38414903f79c287af9836455eb2f15e2771
MD5 fcae5bccb4c11cdaf37715223adcb436
BLAKE2b-256 dde20b72f8c32e14a36e0ee0cabab9fab11a04b06db717fb9593620f931aab8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page