No project description provided
Project description
segments
The segments package provides Unicode Standard tokenization routines and orthography profile segmentation.
Command line usage
Create a text file:
$ echo "aäaaöaaüaa" > text.txt
Now look at the profile:
$ cat text.txt | segments profile
Grapheme frequency mapping
a 7 a
ä 1 ä
ü 1 ü
ö 1 ö
Write the profile to a file:
$ cat text.txt | segments profile > profile.prf
Edit the profile:
$ more profile.prf
Grapheme frequency mapping
aa 0 x
a 7 a
ä 1 ä
ü 1 ü
ö 1 ö
Now tokenize the text without profile:
$ cat text.txt | segments tokenize
a ä a a ö a a ü a a
And with profile:
$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa
$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x
API
>>> from __future__ import unicode_literals, print_function
>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme mapping
ab x
cd y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
segments-1.2.2.tar.gz
(13.7 kB
view hashes)
Built Distribution
Close
Hashes for segments-1.2.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eedcf972941fd6740031dfa0d93c4b32f00c60db9a3758a4c1b5161e2c552165 |
|
MD5 | a896e2c4b5f92571329aba5d4a400f5c |
|
BLAKE2b-256 | 64f65a23502d06b0fadbb92b3da8fd06ffee18a387d0ec989b8c18c928593d6f |