A simple rule-based automatic inflection model for German
Project description
DERBI: DEutscher RegelBasierter Inflektor
DERBI (DEutscher RegelBasierter Inflektor) is a simple rule-based automatic inflection model for German based on spaCy.
Applicable regardless of POS!
Table of Contents
How It Works
- DERBI gets an input text;
- The text is processes with the given spaCy model;
- For each word to be inflected in the text:
- The features predicted by spaCy are overridden with the input features (where specified);
- The words with the result features come through the rules and get inflected;
- The result is assembled into the output.
For the arguments, see below.
Installation
Via pip
pip install DERBI
Via git clone
Install all necessary packages:
pip install -r requirements.txt
Clone DERBI:
git clone https://github.com/maxschmaltz/DERBI
or
from git import Repo
Repo.clone_from('https://github.com/maxschmaltz/DERBI', 'DERBI')
Simple Usage
Note that DERBI works with spaCy. Make sure to have installed any of the spaCy pipelines for German.
Example
# python -m spacy download de_core_news_sm
nlp = spacy.load('de_core_news_md')
from DERBI.derbi import DERBI
derbi = DERBI(nlp)
derbi(
'DERBI sein machen, damit es all Entwickler ein Möglichkeit geben, jedes deutsche Wort automatisch zu beugen',
[{'Number': 'Sing', 'Person': '3', 'Verbform': 'Fin'}, # sein -> ist
{'Verbform': 'Part'}, # machen -> gemacht
{'Case': 'Dat', 'Number': 'Plur'}, # all -> allen
{'Case': 'Dat', 'Number': 'Plur'}, # Entwickler -> entwicklern
{'Gender': 'Fem'}, # ein -> eine
{'Number': 'Sing', 'Person': '3', 'Verbform': 'Fin'}, # geben -> gibt
{'Case': 'Acc', 'Number': 'Plur'}, # jedes -> jede
{'Case': 'Acc', 'Declination': 'Weak', 'Number': 'Plur'}, # deutsche -> deutschen
{'Case': 'Acc', 'Number': 'Plur'}], # wort -> wörter
[1, 2, 6, 7, 8, 10, 12, 13, 14]
)
# Output:
'derbi ist gemacht , damit es allen entwicklern eine möglichkeit gibt , jede deutschen wörter automatisch zu beugen'
Arguments
__init__() Arguments
- model: spacy.lang.de.German
Any of the spaCy pipelines for German. If model is not of the type spacy.lang.de.German, throws an exception.
__call__() Arguments
- text: str
Input text, containing the words to be inflected. It is strongly recommended to call DERBI with a text, not a single word, as spaCy predictions vary depending on the context.
- target_tags: dict or list[dict]
Dicts of category-feature values for each word to be inflected. If None, no inflection is implemented. Default is
None
.NB! As the features are overriden over the ones predicted by spaCy, in
target_tags
only different ones can be specified. Note though, that spaCy predictions are not always correct, so for the DERBI output to be more precise, we recommend to specify the desired features fully. Notice also, that if no tags for an obligatory category were provided (neither by spaCy, neither intarget_tags
), DERBI restores them as default; default features values are available at ValidFeatures (the first element for every category).
- indices: int or list[int]
Indices of the words to be inflected. Default is
0
.NB! The indices order must correspond to the target tags order. Note also, that the input text is lemmatized with the given spaCy model tokenizer, so the indices will be indexing a spacy.tokens.Doc instance.
Output
Returns str: the input text, where the specified words are replaced with the inflection results. The output is normalized.
Tags
DERBI uses Universal POS tags and Universal Features (so does spaCy) with some extensions of features (not POSs). See LabelScheme and ValidFeatures for more details.
The following category-feature values can be used in target-tags
:
Category (explanation) | Valid Features (explanation) | In Universal Features |
---|---|---|
Case | Acc (Accusative) Dat (Dative) Gen (Genitive) Nom (Nominative) |
Yes |
Declination (Applicable for the words with the adjective declination. In German such words are declinated differently depending on the left context) |
Mixed Strong Weak |
No |
Definite (Definiteness) | Def (Definite) Ind (Definite) |
Yes |
Degree (Degree of comparison) | Cmp (Comparative) Pos (Positive) Sup (Superlative) |
Yes |
Foreign (Whether the word is foreign. Applies to POS X) |
Yes | Yes |
Gender | Fem (Feminine) Masc (Masculine) Neut (Neutral) |
Yes |
Mood | Imp (Imperative) Ind (Indicative) Sub (Subjunctive) NB! Sub is for Konjunktiv I when Tense=Pres and for Konjunktiv II when Tense=Past) |
Yes |
Number | Plur (Plural) Sing (Singular) |
Yes |
Person | 1 2 3 |
Yes |
Poss (Whether the word is possessive. Applies to pronouns and determiners.) |
Yes | Yes |
Prontype (Type of a pronoun, a determiner, a quantifier or a pronominal adverb. |
Art (Article) Dem (Demonstrative) Ind (Indefinite) Int (Interrogative) Prs (Personal) Rel Relative |
Yes |
Reflex (Whether the word is reflexive. Applies to pronouns and determiners.) |
Yes | Yes |
Tense | Past Pres (Present) |
Yes |
Verbform (Form of a verb) | Fin (Finite) Inf (Infinitive) Part (Participle) NB! Part is for Partizip I when Tense=Pres and for Partizip II when Tense=Past) |
Yes |
Note though, that categories Definite, Foreign, Poss, Prontype and Reflex cannot be alternated by DERBI, and thus there is no need to specify them.
NB! DERBI accepts capitalized tags. For example, use Prontype, not PronType.
Performance
Disclaimer
For evaluation we used Universal Dependencies German Treebanks. Unfortunately, there are only .conllu
in their GitHub repositories so we had to download some of .txt
datasets and add it to our repository. We do not distribute these datasets though; it is your responsibility to determine whether you have permission to use them.
Evaluation
Evaluation conducted with dataset de_lit-ud-test.txt
from Universal Dependencies German LIT threebank (≈31k tokens), accuracy:
de_core_news_md | de_core_news_sm | de_core_news_lg | |
---|---|---|---|
Overall | 0.947 | 0.949 | 0.95 |
ADJ | 0.81 | 0.847 | 0.841 |
ADP | 0.998 | 0.998 | 0.998 |
ADV | 0.969 | 0.972 | 0.968 |
AUX | 0.915 | 0.921 | 0.912 |
CCONJ | 1.0 | 1.0 | 1.0 |
DET | 0.988 | 0.992 | 0.988 |
INTJ | 1.0 | 1.0 | 1.0 |
NOUN | 0.958 | 0.959 | 0.962 |
NUM | 0.935 | 0.987 | 0.914 |
PART | 1.0 | 1.0 | 1.0 |
PRON | 0.921 | 0.929 | 0.928 |
PROPN | 0.941 | 0.926 | 0.916 |
SCONJ | 0.999 | 0.999 | 0.996 |
VERB | 0.813 | 0.792 | 0.824 |
X | 1.0 | 1.0 | 1.0 |
If you are interested in the way we obtained the results, please refer to test0.py.
Or you could check it with the following code:
from DERBI.test import test0
test0.main()
Notice that performance might vary depending on the dataset. Also remember, that if spaCy might make mistakes predicting (that means, that in some cases DERBI inflection is correct but does not correspond spaCy's tags), which also affects evaluation.
License
Copyright 2022 Max Schmaltz: @maxschmaltz
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file DERBI-0.0.tar.gz
.
File metadata
- Download URL: DERBI-0.0.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 279f6a16e73bbc0335d9467b6173fc3423339dc539ce72b304f17d7eab710f7d |
|
MD5 | 51c4c81082b4316b59b1f5c3a43dc502 |
|
BLAKE2b-256 | e3ba7d184bf866e60898d51bb5e37e3deef34104a3156d61ae1445d6e5170692 |