Skip to main content

No project description provided

Project description

segments

Build Status codecov PyPI

DOI

The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from The Unicode Cookbook (Moran and Cysouw 2018 DOI).

Command line usage

Create a text file:

$ echo "aäaaöaaüaa" > text.txt

Now look at the profile:

$ cat text.txt | segments profile
Grapheme        frequency       mapping
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Write the profile to a file:

$ cat text.txt | segments profile > profile.prf

Edit the profile:

$ more profile.prf
Grapheme        frequency       mapping
aa      0       x
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Now tokenize the text without profile:

$ cat text.txt | segments tokenize
a ä a a ö a a ü a a

And with profile:

$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x

API

>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme	mapping
ab	x
cd	y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

segments-2.2.1.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

segments-2.2.1-py2.py3-none-any.whl (15.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file segments-2.2.1.tar.gz.

File metadata

  • Download URL: segments-2.2.1.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/28.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.6 tqdm/4.56.2 importlib-metadata/4.10.1 keyring/22.0.1 rfc3986/1.5.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for segments-2.2.1.tar.gz
Algorithm Hash digest
SHA256 515ae188f21d24e420d48ad45689edc747d961d6b52fde22e47500a8d85f2741
MD5 1db512116f28df6b9a3326d9fc19558f
BLAKE2b-256 0ba6b678440988daa66ac151bc3ca24f2ad8dcfdb591604f5c2b83e2515b1f58

See more details on using hashes here.

File details

Details for the file segments-2.2.1-py2.py3-none-any.whl.

File metadata

  • Download URL: segments-2.2.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/28.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.6 tqdm/4.56.2 importlib-metadata/4.10.1 keyring/22.0.1 rfc3986/1.5.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for segments-2.2.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 069860ae5a499ad7bd86e23ee52250a16e61ba3474c17e515b16d494ac1423c1
MD5 05273ad48b946ded53a686fe0136c756
BLAKE2b-256 93d474dba5011533e66becf35aae5cf1d726e760f445db052592bad70e75305c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page