This is a script to convert the output file of doccano to a format that is easy to handle with sklearn-crfsuite.
Project description
Span to IBO
This is a script to convert the output file of doccano
to a format that is easy to handle with sklearn-crfsuite
.
Usage
python doccano.py --input_path <path to doccano exported jsonl file> --output_path <path to output file>
Input file format
The input file is a jsonl file exported from doccano
.
{"text": "東京都渋谷区渋谷 2丁目2−8 渋谷マークシティ", "labels": [[0, 9, "LOC"]]}
{"text": "東京都渋谷区神南 1丁目1−1", "labels": [[0, 7, "LOC"]]}
...
Output file format
The output file is a json file of the following format:
[
[
{"word": "東京都", "label": "B-LOC", "pos_tag": "名詞", "pos_tag[:2]": "名詞,固有名詞", "pos_tag_all": "名詞,固有名詞,地域,一般,*,*,東京都,トウキョウト,トーキョート", "BOS": true, "EOS": false},
{"word": "渋谷区", "label": "I-LOC", "pos_tag": "名詞", "pos_tag[:2]": "名詞,固有名詞", "pos_tag_all": "名詞,固有名詞,地域,一般,*,*,渋谷区,シブヤク,シブヤク", "BOS": false, "EOS": false},
...
],
...,
]
Reference
This program is mainly based on the following repository. https://github.com/ToshihikoSakai/jsontoconll
All mistakes in this script are mine.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
span_to_ibo-0.1.0.tar.gz
(4.5 kB
view hashes)
Built Distribution
Close
Hashes for span_to_ibo-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06074b8fb2bdd1b39ecf061c80197e9c46b32e47a473a119160c5f2f249d0342 |
|
MD5 | eb6966ec82683cff004508cef0374d3a |
|
BLAKE2b-256 | e09f4469d9c5b57368745459034f3b2c7bd09fdbd64ef6e4f5ac81d5dcf9a932 |