character span label to tokenized base label for Japanese text
Project description
noyaki
Converts character span label information to tokenized text-based label information.
Installation
$ pip install noyaki
Usage
Pass the tokenized text and label information as arguments to the convert function.
import noyaki
label_list = noyaki.convert(
['明日', 'は', '田中', 'さん', 'に', '会う'],
[[3, 5, 'PERSON']]
)
print(label_list)
# ['O', 'O', 'U-PERSON', 'O', 'O', 'O']
If you want to remove the subword symbol (eg. ##), specify the subword
argument.
import noyaki
label_list = noyaki.convert(
['明日', 'は', '田', '##中', 'さん', 'に', '会う'],
[[3, 5, 'PERSON']],
subword="##"
)
print(label_list)
# ['O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O']
If you want to use IOB2 tag format, specify the scheme
argument.
import noyaki
label_list = noyaki.convert(
['明日', 'は', '田', '##中', 'さん', 'に', '会う'],
[[3, 5, 'PERSON']],
scheme="IOB2"
)
print(label_list)
# ['O', 'O', 'B-PERSON', 'I-PERSON', 'O', 'O', 'O']
Note
Only Japanese is supported.
supported tag formats are follow:
- BILOU
- IOB2
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
noyaki-0.2.0.tar.gz
(3.0 kB
view hashes)
Built Distribution
Close
Hashes for noyaki-0.2.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0791462f39346d501687535fc01ee7ed8bde359b7978d31ea542a43f8550b11 |
|
MD5 | ac23d282eff87e632e3070af984fa625 |
|
BLAKE2b-256 | e05424873254f71f183a580c76927928e6b6ad41649ee1bc3ffaf197278d2fcd |