Skip to main content

Natural language processing for Icelandic

Project description

superlinter License: AGPL v3


Greynir

GreynirSeq

GreynirSeq is a natural language parsing toolkit for Icelandic focused on sequence modeling with neural networks. It is under active development and is in its early stages.

The modeling part (nicenlp) of GreynirSeq is built on top of the excellent fairseq from Facebook (which is built on top of pytorch).

GreynirSeq is licensed under the GNU AFFERO GPLv3 license unless otherwise stated at the top of a file.

What's new?

  • An Icelandic RoBERTa model, IceBERT finetuned for NER and POS tagging.
  • Icelandic - English translation.

What's on the horizon?

  • More fine tuning tasks for Icelandic, constituency parsing and grammatical error detection

Be aware that usage of the CLI or otherwise downloading model files will result in downloading of gigabytes of data. Simply installing greynirseq will not download any models, they are automatically downloaded on-demand.

Installation

In a suitable virtual environment

# From PyPI
$ pip install greynirseq
# or from git main branch
$ pip install git+https://github.com/mideind/greynirseq@main

Features

TL;DR give me the CLI

The greynirseq CLI interface can be used to run pretrained models for various tasks. Run pip install greynirseq && greynirseq -h to see what options are available.

POS

Input is accepted from file containing a single tokenized sentence per line, or from stdin.

$ echo "Systurnar Guðrún og Monique átu einar um jólin á McDonalds ." | greynirseq pos --input -

nvfng nven-s c n---s sfg3fþ lvfnsf af nhfog af n----s pl

NER

Input is accepted from file containing a single tokenized sentence per line, or from stdin.

$ echo "Systurnar Guðrún og Monique átu einar um jólin á McDonalds ." | greynirseq ner --input -

O B-Person O B-Person O O O O O B-Organization O

Translation

Input is accepted from file containing a single untokenized sentence per line, or from stdin.

# For en->is translation
$ echo "This is an awesome test that shows how to use a pretrained translation model." | greynirseq translate --source-lang en --target-lang is

Þetta er æðislegt próf sem sýnir hvernig nota  forprófað þýðingarlíkan.

# For is->en translation
$ echo "Þetta er æðislegt próf sem sýnir hvernig nota má forprófað þýðingarlíkan." | greynirseq translate --source-lang is --target-lang en

This is an awesome test that shows how a pre-tested translation model can be used.

Neural Icelandic Language Processing - NIceNLP

IceBERT is an Icelandic BERT-based (RoBERTa) language model that is suitable for fine tuning on downstream tasks.

The following fine tuning tasks are available both through the greynirseq CLI and for loading programmatically.

  1. POS tagging
  2. NER tagging

There are also a some translation models available. They are Transformer models trained from scratch or finetuned based on mBART25.

  1. Translation

Development

To install GreynirSeq in development mode we recommend using poetry as shown below

pip install poetry && poetry install

Linting

All code is checked with Super-Linter in a GitHub Action, we recommend running it locally before pushing

./run_linter.sh

Type annotation

Type annotation will soon be checked with mypy and should be included.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

greynirseq-0.4.tar.gz (132.6 kB view details)

Uploaded Source

Built Distribution

greynirseq-0.4-cp39-cp39-manylinux_2_34_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

File details

Details for the file greynirseq-0.4.tar.gz.

File metadata

  • Download URL: greynirseq-0.4.tar.gz
  • Upload date:
  • Size: 132.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.9.7 Linux/5.13.0-40-generic

File hashes

Hashes for greynirseq-0.4.tar.gz
Algorithm Hash digest
SHA256 409c6675c5c6fe62dfbb37278e31e08468c5ed43550398a94495dd8d07636982
MD5 6539621dce90e5810672e57bbb60fbc1
BLAKE2b-256 0eab28d22de928aefea103fabc420e8312a0696d5f229675ad17b4a4b9a1f87a

See more details on using hashes here.

File details

Details for the file greynirseq-0.4-cp39-cp39-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for greynirseq-0.4-cp39-cp39-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 8a7ef2ade73d3268ddc3817016d9e16d29957518f7aadfbdd7dd82dd1b8b9c82
MD5 0368143d1ff32d1569583c531b81f920
BLAKE2b-256 547a68ccd887f7a2c396ef7d944ebc7b729d8a51446d52cfaae5d0de59cfbe21

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page