FreeLing plug-in for Sparv (Språkbanken's corpus annotation pipeline)
Project description
sparv-sbx-freeling
This is a plugin for the Sparv pipeline containing a wrapper for FreeLing. Please observe that this plugin has a more restrictive license than the Sparv piepeline!
This plugin allows you to run the Sparv pipeline and get sentence segmentation, tokenisation, baseform analysis, and part-of-speech annotations for the following languages:
- Asturian
- Catalan
- English
- French
- Galician
- German
- Italian
- Norwegian
- Portuguese
- Russian
- Slovenian
- Spanish
Furthermore Sparv will convert the FreeLing POS-tags into Universal POS tags and output them as a separate annotation.
Some of these languages (Catalan, English, German, Portuguese and Spanish) also support named-entity recognition.
Prerequisites
Installation
Option 1: Installation from pypi with pipx:
pipx inject sparv-pipeline sparv-sbx-freeling
Option 2: Installation from GitHub with pipx:
pipx inject sparv-pipeline https://github.com/spraakbanken/sparv-sbx-freeling/archive/latest.tar.gz
Option 3: Manual download of plugin and installation in your sparv-pipeline virtual environment:
source [path to sparv-pipeline virtual environment]/bin/activate
pip install [path to the downloaded sparv-sbx-freeling directory]
Usage
The Sparv pipeline needs a config file describing your corpus and the desired output format. Please refer to the Sparv pipeline user manual for more details on config files and running Sparv.
Example input:
<text title="Example">
This is an example for how to run Sparv.
</text>
Example command for creating xml with annotations:
sparv run
Result file:
<?xml version="1.0" encoding="UTF-8"?>
<text lix="20.00" title="Example">
<sentence>
<token baseform="this" pos="DT" upos="DET">This</token>
<token baseform="be" pos="VBZ" upos="VERB">is</token>
<token baseform="a" pos="DT" upos="DET">an</token>
<token baseform="example" pos="NN" upos="NOUN">example</token>
<token baseform="for" pos="IN" upos="ADP">for</token>
<token baseform="how" pos="WRB" upos="ADV">how</token>
<token baseform="to" pos="TO" upos="PART">to</token>
<token baseform="run" pos="VB" upos="VERB">run</token>
<token baseform="sparv" ne_type="person" pos="NP00SP0" upos="PROPN">Sparv</token>
<token baseform="." pos="Fp" upos="PUNCT">.</token>
</sentence>
</text>
Additional Info about Annotations
A full list of what analyses are supported for what languages can be found here:
https://freeling-user-manual.readthedocs.io/en/latest/basics/#supported-languages
Integrating dependency parsing
FreeLing supports dependency parsing for some languages. The output format is a bit cumbersome though.
Input:
This is a sentence.
Output:
DT/top/(This this DT -) [
vb-be/modnorule/(is be VBZ -)
sn-chunk/modnorule/(sentence sentence NN -) [
DT/det/(a a DT -)
]
st-brk/modnorule/(. . Fp -)
]
It is possible to write a new parser to handle this format but so far this has not been a priority for us.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sparv-sbx-freeling-5.2.0.tar.gz
.
File metadata
- Download URL: sparv-sbx-freeling-5.2.0.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6b71f9679ed830dcddf6d875a2dc03708ef847ff45288a656be3f55e378cf51 |
|
MD5 | af6634513856821332fd4427de0f996c |
|
BLAKE2b-256 | 6ba5ab62a336c3e641d74ec53e3a9f35346e2daada8a92f087baaeaa39198f24 |
File details
Details for the file sparv_sbx_freeling-5.2.0-py3-none-any.whl
.
File metadata
- Download URL: sparv_sbx_freeling-5.2.0-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 201028620bc4d47a6249f56b5ca911ed31b1b44751bda8946448b7c596cd147a |
|
MD5 | 495d6caeb9324cc440950eec8814f9e8 |
|
BLAKE2b-256 | 1adddc29a93b7e33e4a83ac01d722019cd3ec55417758ecb047ad5e733e36912 |