No project description provided
Project description
segments
The segments package provides Unicode Standard tokenization routines and orthography profile segmentation.
Command line usage
Create a text file:
$ echo "aäaaöaaüaa" > text.txt
Now look at the profile:
$ cat text.txt | segments profile
Grapheme frequency mapping
a 7 a
ä 1 ä
ü 1 ü
ö 1 ö
Write the profile to a file:
$ cat text.txt | segments profile > profile.prf
Edit the profile:
$ more profile.prf
Grapheme frequency mapping
aa 0 x
a 7 a
ä 1 ä
ü 1 ü
ö 1 ö
Now tokenize the text without profile:
$ cat text.txt | segments tokenize
a ä a a ö a a ü a a
And with profile:
$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa
$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x
API
>>> from __future__ import unicode_literals, print_function
>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
mapping Grapheme
ab x
cd y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
segments-1.2.1.tar.gz
(12.9 kB
view hashes)
Built Distribution
Close
Hashes for segments-1.2.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b3a4486239318e3b5b2b63159db33b0813fd63febc13a17c5748bd3437b5dd9 |
|
MD5 | d9ba147cea38de0351867f5941d310d7 |
|
BLAKE2b-256 | 29b23de9c177de6bb515975885a4620565da51a5d1093648ef6201593b1cd7ed |