Skip to main content

Word segmentation models

Project description

wordseg

DOI PyPI version Supported Python versions CircleCI

wordseg is a Python package of word segmentation models.

Table of contents:

Installation

wordseg is available through pip:

pip install wordseg

To install wordseg from the GitHub source:

git clone https://github.com/jacksonllee/wordseg.git
cd wordseg
pip install -e ".[dev]"

Usage

wordseg implements a word segmentation model as a Python class. An instantiated model class object has the following methods (emulating the scikit-learn-styled API for machine learning):

  • fit: Train the model with segmented sentences.
  • predict: Predict the segmented sentences from unsegmented sentences.

The implemented model classes are as follows:

  • RandomSegmenter: Segmentation is predicted at random at each potential word boundary independently for some given probability. No training is required.
  • LongestStringMatching: This model constructs predicted words by moving from left to right along an unsegmented sentence and finding the longest matching words, constrained by a maximum word length parameter.

Sample code snippet:

from src.wordseg import LongestStringMatching

# Initialize a model.
model = LongestStringMatching(max_word_length=4)

# Train the model.
# `fit` takes an iterable of segmented sentences (a tuple or list of strings).
model.fit(
  [
    ("this", "is", "a", "sentence"),
    ("that", "is", "not", "a", "sentence"),
  ]
)

# Make some predictions; `predict` gives a generator, which is materialized by list() here.
list(model.predict(["thatisadog", "thisisnotacat"]))
# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]
# We can't get 'dog' and 'cat' because they aren't in the training data.

License

MIT License. Please see LICENSE.txt.

Changelog

Please see CHANGELOG.md.

Contributing

Please see CONTRIBUTING.md.

Citation

Lee, Jackson L. 2023. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433

@software{leengrams,
  author       = {Jackson L. Lee},
  title        = {wordseg: Word segmentation models in Python},
  year         = 2023,
  doi          = {10.5281/zenodo.4077433},
  url          = {https://doi.org/10.5281/zenodo.4077433}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordseg-0.0.5.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

wordseg-0.0.5-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file wordseg-0.0.5.tar.gz.

File metadata

  • Download URL: wordseg-0.0.5.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.5

File hashes

Hashes for wordseg-0.0.5.tar.gz
Algorithm Hash digest
SHA256 0ba2b87bcfa801508e8dbeb71c62d3056fc462b2610fd5d883680f636204700c
MD5 4fd9b2abf4721d0a47a56b1d2d6c42d1
BLAKE2b-256 52f8fdf2a02790257d75017a36018ea7ce7f8530beaecec6dc97ba7c46769dc5

See more details on using hashes here.

File details

Details for the file wordseg-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: wordseg-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.5

File hashes

Hashes for wordseg-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 7c607f58e040c19d187e2886b093b06470ba016454377e19f63e1b866774b3d6
MD5 0a7301c64a87fa9a5c5fde842661ba36
BLAKE2b-256 5ef436a9e82678df2f11fdb5803c7463854b63b4151fff729aba807e4d7f665c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page