Skip to main content

No project description provided

Project description

segments

Build Status codecov PyPI

DOI

The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from The Unicode Cookbook (Moran and Cysouw 2018 DOI).

Command line usage

Create a text file:

$ echo "aäaaöaaüaa" > text.txt

Now look at the profile:

$ cat text.txt | segments profile
Grapheme        frequency       mapping
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Write the profile to a file:

$ cat text.txt | segments profile > profile.prf

Edit the profile:

$ more profile.prf
Grapheme        frequency       mapping
aa      0       x
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Now tokenize the text without profile:

$ cat text.txt | segments tokenize
a ä a a ö a a ü a a

And with profile:

$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x

API

>>> from __future__ import unicode_literals, print_function
>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme	mapping
ab	x
cd	y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

segments-2.2.0.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

segments-2.2.0-py2.py3-none-any.whl (15.0 kB view details)

Uploaded Python 2Python 3

File details

Details for the file segments-2.2.0.tar.gz.

File metadata

  • Download URL: segments-2.2.0.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.12

File hashes

Hashes for segments-2.2.0.tar.gz
Algorithm Hash digest
SHA256 b2ef8e4249c0878e24d2f2bd7259daff1b745614ffc94743937b1699a29b6ed0
MD5 97cb803a9f23d8b426e0119778e624a1
BLAKE2b-256 6297d9fde7dc5e1a1abe78a3299e29157c4221ff38e8756b76f4f65123adf8c1

See more details on using hashes here.

File details

Details for the file segments-2.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: segments-2.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.12

File hashes

Hashes for segments-2.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 1175ca7210e443045c0b0642d735cb7cad7cd5e09d6914a8f04cfc7c4be9af1a
MD5 b9127b18041c5669e676f6979162ab35
BLAKE2b-256 1eae02d31d73cfc3fa1dc74b7b7f14820fadc287e74406583d7af7b80fcaac41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page