Skip to main content

Word segmentation models

Project description

wordseg

DOI PyPI version Supported Python versions CircleCI

wordseg is a Python package of word segmentation models.

Table of contents:

Installation

wordseg is available through pip:

pip install wordseg

To install wordseg from the GitHub source:

git clone https://github.com/jacksonllee/wordseg.git
cd wordseg
pip install -r dev-requirements.txt  # For running the linter and tests
pip install -e .

Usage

wordseg implements a word segmentation model as a Python class. An instantiated model class object has the following methods (emulating the scikit-learn-styled API for machine learning):

  • fit: Train the model with segmented sentences.
  • predict: Predict the segmented sentences from unsegmented sentences.

The implemented model classes are as follows:

  • RandomSegmenter: Segmentation is predicted at random at each potential word boundary independently for some given probability. No training is required.
  • LongestStringMatching: This model constructs predicted words by moving from left to right along an unsegmented sentence and finding the longest matching words, constrained by a maximum word length parameter.

Sample code snippet:

from src.wordseg import LongestStringMatching

# Initialize a model.
model = LongestStringMatching(max_word_length=4)

# Train the model.
# `fit` takes an iterable of segmented sentences (a tuple or list of strings).
model.fit(
  [
    ("this", "is", "a", "sentence"),
    ("that", "is", "not", "a", "sentence"),
  ]
)

# Make some predictions; `predict` gives a generator, which is materialized by list() here.
list(model.predict(["thatisadog", "thisisnotacat"]))
# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]
# We can't get 'dog' and 'cat' because they aren't in the training data.

License

MIT License. Please see LICENSE.txt.

Changelog

Please see CHANGELOG.md.

Contributing

Please see CONTRIBUTING.md.

Citation

Lee, Jackson L. 2022. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433

@software{leengrams,
  author       = {Jackson L. Lee},
  title        = {wordseg: Word segmentation models in Python},
  year         = 2022,
  doi          = {10.5281/zenodo.4077433},
  url          = {https://doi.org/10.5281/zenodo.4077433}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordseg-0.0.3.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

wordseg-0.0.3-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file wordseg-0.0.3.tar.gz.

File metadata

  • Download URL: wordseg-0.0.3.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.2

File hashes

Hashes for wordseg-0.0.3.tar.gz
Algorithm Hash digest
SHA256 a5e00aa86ebd4ddc0eb1a6729c081fac1add0b9aff253cd9b58989509eab965a
MD5 6224b5d215a163200d820cf6d7ab2af5
BLAKE2b-256 0bd0f4b6ca204faa90f99bd1b1f76ed05b49932695f6c9c1949a4a634c9eec05

See more details on using hashes here.

Provenance

File details

Details for the file wordseg-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: wordseg-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.2

File hashes

Hashes for wordseg-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f5618ae8201cb33fe61e4ea41b5ea2376b7755b976ee0457a9e43d69094a6df8
MD5 cd7a15739a5e91719ad17b7f0bf686d4
BLAKE2b-256 90a0ccb2da34c1b563f7a6c43c170db712cdff1ad41f9d0796eaac9d02c50881

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page