A Python library for The Oslo-Bergen Tagger
Project description
This is a Python library for The Oslo-Bergen Tagger, which parses the output of the tagger to a friendly format. Only Python 3 is supported at this time.
The library is in beta. See Roadmap for things that need to get implemented before a v1.0.0 can be released.
Installation
You need to have The Oslo-Bergen Tagger installed, and the environment variable OBT_PATH set to the path of its installation directory. You can use the provided code snippet below, or install it using the instructions in The-Oslo-Bergen-Tagger GitHub repository. The following code snippet installs it in your home directory. If you want to install it somewhere else, you can change the INSTALL_DIR variable on the first line to your preferred installation directory.
INSTALL_DIR=$HOME THIS_DIR=$PWD cd $INSTALL_DIR git clone git@github.com:noklesta/The-Oslo-Bergen-Tagger.git cd The-Oslo-Bergen-Tagger ./bootstrap.sh export OBT_PATH=$INSTALL_DIR/The-Oslo-Bergen-Tagger echo 'export OBT_PATH=$OBT_PATH' >> $HOME/.bashrc cd $THIS_DIR
You can then install this Python library with pip. To install for all users, do:
sudo pip3 install obt
To just install for your user, do:
pip3 install --user obt
And you are good to go!
Usage
First, import the library
import obt
Then, you can tag a string by passing it to the tag_bm function:
my_string = "Jeg er streng." tags = obt.tag_bm(my_string)
Or you can pass a file name using the file keyword argument:
tags = obt.tag_bm(file="my_document.txt")
The resulting tags will be an array of tag objects, like so:
[ { "tall": "ent", "type": "pers hum", "base": "jeg", "person": "1", "word_tag": "<jeg>", "kasus": "nom", "raw_tags": "pron ent pers hum nom 1", "word": "Jeg", "ordklasse": "pron" }, { "word_tag": "<er>", "base": "v\u00e6re", "tilleggstagger": [ "a5", "pr1", "pr2", "<aux1/perf_part>" ], "tid": "pres", "raw_tags": "verb pres a5 pr1 pr2 <aux1/perf_part>", "word": "er", "ordklasse": "verb" }, { "type": "appell", "best": "ub", "base": "streng", "word_tag": "<streng>", "tall": "ent", "ordklasse": "subst", "raw_tags": "subst appell mask ub ent", "word": "streng", "kj\u00f8nn": "mask" }, { "word_tag": "<.>", "base": "$.", "tilleggstagger": [ "<<<", "<punkt>", "<<<" ], "raw_tags": "clb <<< <punkt> <<<", "word": ".", "ordklasse": "clb" } ]
You can easily save this to a JSON file with the obt.save_json function:
obt.save_json(tags, 'my_tags.json')
Format
A documentation of the tag format will come here.
Roadmap
Before a v1.0.0 release, the following boxes should be checked: - [ ] Put “tilleggstagger” in proper items in tags object. - [ ] Implement function for ./tag-nostat-bm.sh from https://github.com/noklesta/The-Oslo-Bergen-Tagger - [ ] Implement function for ./tag-nostat-nn.sh from https://github.com/noklesta/The-Oslo-Bergen-Tagger - [ ] Python 2 support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size obt-0.1.0.tar.gz (4.3 kB) | File type Source | Python version None | Upload date | Hashes View |