Skip to main content

Segmentation with orthography profiles

Project description

segments

Build Status PyPI

DOI

The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from The Unicode Cookbook (Moran and Cysouw 2018 DOI).

Command line usage

Create a text file:

$ echo "aäaaöaaüaa" > text.txt

Now look at the profile:

$ cat text.txt | segments profile
Grapheme        frequency       mapping
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Write the profile to a file:

$ cat text.txt | segments profile > profile.prf

Edit the profile:

$ more profile.prf
Grapheme        frequency       mapping
aa      0       x
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Now tokenize the text without profile:

$ cat text.txt | segments tokenize
a ä a a ö a a ü a a

And with profile:

$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x

API

>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme	mapping
ab	x
cd	y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

segments-2.3.0.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

segments-2.3.0-py2.py3-none-any.whl (15.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file segments-2.3.0.tar.gz.

File metadata

  • Download URL: segments-2.3.0.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for segments-2.3.0.tar.gz
Algorithm Hash digest
SHA256 381143f66f59eaf45398f5bb57f899d6501be011048ec5f92754c9b24b181615
MD5 d3ed8cb5b2c044e0391ac9f792d606fa
BLAKE2b-256 9b4c25e499df952528004ff3f7f8e1e63d20773ed30141ed17c285adb5446f55

See more details on using hashes here.

File details

Details for the file segments-2.3.0-py2.py3-none-any.whl.

File metadata

  • Download URL: segments-2.3.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for segments-2.3.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 30a5656787071430cd22422e04713b2a9beabe1a97d2ebf37f716a56f90577a3
MD5 a8ef9b8366d34400a5582b3610a237fe
BLAKE2b-256 1118cb614939ccd46d336013cab705f1e11540ec9c68b08ecbb854ab893fc480

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page