FreeLing plug-in for Sparv (Språkbanken's corpus annotation pipeline)

Project description

sparv-sbx-freeling

This is a plugin for the Sparv pipeline containing a wrapper for FreeLing. Please observe that this plugin has a more restrictive license than the Sparv piepeline!

This plugin allows you to run the Sparv pipeline and get sentence segmentation, tokenisation, baseform analysis, and part-of-speech annotations for the following languages:

Asturian
Catalan
English
French
Galician
German
Italian
Norwegian
Portuguese
Russian
Slovenian
Spanish

Furthermore Sparv will convert the FreeLing POS-tags into Universal POS tags and output them as a separate annotation.

Some of these languages (Catalan, English, German, Portuguese and Spanish) also support named-entity recognition.

Prerequisites

Installation

Option 1: Installation from pypi with pipx:

pipx inject sparv-pipeline sparv-sbx-freeling

Option 2: Installation from GitHub with pipx:

pipx inject sparv-pipeline https://github.com/spraakbanken/sparv-sbx-freeling/archive/latest.tar.gz

Option 3: Manual download of plugin and installation in your sparv-pipeline virtual environment:

source [path to sparv-pipeline virtual environment]/bin/activate
pip install [path to the downloaded sparv-sbx-freeling directory]

Usage

The Sparv pipeline needs a config file describing your corpus and the desired output format. Please refer to the Sparv pipeline user manual for more details on config files and running Sparv.

Example input:

<text title="Example">
  This is an example for how to run Sparv.
</text>

Example command for creating xml with annotations:

sparv run

Result file:

<?xml version="1.0" encoding="UTF-8"?>
<text lix="20.00" title="Example">
  <sentence>
    <token baseform="this" pos="DT" upos="DET">This</token>
    <token baseform="be" pos="VBZ" upos="VERB">is</token>
    <token baseform="a" pos="DT" upos="DET">an</token>
    <token baseform="example" pos="NN" upos="NOUN">example</token>
    <token baseform="for" pos="IN" upos="ADP">for</token>
    <token baseform="how" pos="WRB" upos="ADV">how</token>
    <token baseform="to" pos="TO" upos="PART">to</token>
    <token baseform="run" pos="VB" upos="VERB">run</token>
    <token baseform="sparv" ne_type="person" pos="NP00SP0" upos="PROPN">Sparv</token>
    <token baseform="." pos="Fp" upos="PUNCT">.</token>
  </sentence>
</text>

Additional Info about Annotations

A full list of what analyses are supported for what languages can be found here:

https://freeling-user-manual.readthedocs.io/en/latest/basics/#supported-languages

Integrating dependency parsing

FreeLing supports dependency parsing for some languages. The output format is a bit cumbersome though.

Input:

This is a sentence.

Output:

DT/top/(This this DT -) [
  vb-be/modnorule/(is be VBZ -)
  sn-chunk/modnorule/(sentence sentence NN -) [
    DT/det/(a a DT -)
  ]
  st-brk/modnorule/(. . Fp -)
]

It is possible to write a new parser to handle this format but so far this has not been a priority for us.

Project details

Release history Release notifications | RSS feed

5.2.0

Dec 18, 2023

This version

5.0.0

Aug 10, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparv-sbx-freeling-5.0.0.tar.gz (20.4 kB view hashes)

Uploaded Aug 10, 2022 Source

Built Distribution

sparv_sbx_freeling-5.0.0-py3-none-any.whl (19.9 kB view hashes)

Uploaded Aug 10, 2022 Python 3

Hashes for sparv-sbx-freeling-5.0.0.tar.gz

Hashes for sparv-sbx-freeling-5.0.0.tar.gz
Algorithm	Hash digest
SHA256	`72da34308ce37278696183e6cf974a11c780a5c6d19bde46d9460b04767887fe`
MD5	`18c0b9ad56873258821a5277c30ac792`
BLAKE2b-256	`334c7bd14bb729004ee5bd7bb885e2ce631a4e644a4a99969f813c09c492814f`

Hashes for sparv_sbx_freeling-5.0.0-py3-none-any.whl

Hashes for sparv_sbx_freeling-5.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`de4d2130a6d0c8d1248c5733fe6f5a0fcefcb180a9f4e4f48bfd8d1129be0a7f`
MD5	`bb70623a1a83cc8057440ec9564fe1fa`
BLAKE2b-256	`5bbc1818d48c3a8539636d1a25ea1322812d372d977985f49d389b9dba3b50ed`