Skip to main content

wtpsplit with minimal dependencies

Project description

Open in Dev Containers Open in GitHub Codespaces

✂️ wtpsplit-lite

🪓 wtpsplit is a Python package that offers training, inference, and evaluation of state-of-the-art Segment any Text (SaT) models for partitioning text into sentences.

✂️ wtpsplit-lite is a lightweight version of wtsplit that only retains accelerated ONNX inference of SaT models with minimal dependencies:

  1. huggingface-hub to download the model
  2. numpy to process the model in- and output
  3. onnxruntime to run the model
  4. tokenizers to tokenize the text for the model

Installing

To install this package, run:

pip install wtpsplit-lite

Using

[!TIP] For a complete list of Segment any Text (SaT) models and all SaT.split keyword arguments, see the wtsplit README.

Example usage:

from wtpsplit_lite import SaT

text = """
It is known that Maxwell’s electrodynamics—as usually understood at the
present time—when applied to moving bodies, leads to asymmetries which do
not appear to be inherent in the phenomena. Take, for example, the recipro-
cal electrodynamic action of a magnet and a conductor.
"""

# Fast (~150ms/page), good quality:
sat = SaT("sat-3l-sm")
sentences = sat.split(text, stride=128, block_size=256)

# Slow, highest quality:
sat = SaT("sat-12l-sm")
sentences = sat.split(text)

This package also contributes a new 'hat' weighting scheme to wtpsplit that improves output quality when using large strides. To enable it, set weighting="hat" as follows:

# Fast (~150ms/page), better quality:
sat = SaT("sat-3l-sm")
sentences = sat.split(text, stride=128, block_size=256, weighting="hat")

[!NOTE] In wtpsplit, the SaT implementation treats newlines as sentence boundaries by default. However, this leads to poor results on text extracted from PDF such as in the example above. In wtpsplit-lite, newlines are therefore treated as whitepace by default. You can choose which behavior you prefer with the treat_newline_as_space boolean keyword argument of the SaT.split method.

Contributing

Prerequisites
  1. Generate an SSH key and add the SSH key to your GitHub account.

  2. Configure SSH to automatically load your SSH keys:

    cat << EOF >> ~/.ssh/config
    
    Host *
      AddKeysToAgent yes
      IgnoreUnknown UseKeychain
      UseKeychain yes
      ForwardAgent yes
    EOF
    
  3. Install Docker Desktop.

  4. Install VS Code and VS Code's Dev Containers extension. Alternatively, install PyCharm.

  5. Optional: install a Nerd Font such as FiraCode Nerd Font and configure VS Code or PyCharm to use it.

Development environments

The following development environments are supported:

  1. ⭐️ GitHub Codespaces: click on Open in GitHub Codespaces to start developing in your browser.

  2. ⭐️ VS Code Dev Container (with container volume): click on Open in Dev Containers to clone this repository in a container volume and create a Dev Container with VS Code.

  3. ⭐️ uv: clone this repository and run the following from root of the repository:

    # Create and install a virtual environment
    uv sync --python 3.10 --all-extras
    
    # Activate the virtual environment
    source .venv/bin/activate
    
    # Install the pre-commit hooks
    pre-commit install --install-hooks
    
  4. VS Code Dev Container: clone this repository, open it with VS Code, and run Ctrl/⌘ + + PDev Containers: Reopen in Container.

  5. PyCharm Dev Container: clone this repository, open it with PyCharm, create a Dev Container with Mount Sources, and configure an existing Python interpreter at /opt/venv/bin/python.

Developing
  • This project follows the Conventional Commits standard to automate Semantic Versioning and Keep A Changelog with Commitizen.
  • Run poe from within the development environment to print a list of Poe the Poet tasks available to run on this project.
  • Run uv add {package} from within the development environment to install a run time dependency and add it to pyproject.toml and uv.lock. Add --dev to install a development dependency.
  • Run uv sync --upgrade from within the development environment to upgrade all dependencies to the latest versions allowed by pyproject.toml. Add --only-dev to upgrade the development dependencies only.
  • Run cz bump to bump the package's version, update the CHANGELOG.md, and create a git tag. Then push the changes and the git tag with git push origin main --tags.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wtpsplit_lite-0.2.0.tar.gz (35.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wtpsplit_lite-0.2.0-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file wtpsplit_lite-0.2.0.tar.gz.

File metadata

  • Download URL: wtpsplit_lite-0.2.0.tar.gz
  • Upload date:
  • Size: 35.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.7.13

File hashes

Hashes for wtpsplit_lite-0.2.0.tar.gz
Algorithm Hash digest
SHA256 aaf9168cdbba9a20671df93dea0bd6ffb6f3a39f1b74cdd86ab3e81a073e97e8
MD5 a1711b8fcbb741378d1701b0c5146710
BLAKE2b-256 98792f74be2db6b03041fc3b256696da66c7dfd6ae2552907f950b7e071659f2

See more details on using hashes here.

File details

Details for the file wtpsplit_lite-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for wtpsplit_lite-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3277a711f5c4c16162901fc37426557f4eb6eb6d90ba8fca276433fa9aa5fa4c
MD5 629f1e303a2a6970edce9b39b736e3e6
BLAKE2b-256 3420f9453db785f3dbb4dd6b44aa073a3b2fcbde5e91b8746c769d348345d105

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page