SFST/SMOR/DWDS-based German morphology
Project description
DWDSmor – German Morphology
DWDSmor implements the lemmatisation and morphosyntactic analysis of word forms as well as the generation of paradigms of lexical words in written German. DWDSmor’s finite state transducers (the DWDSmor automata) map word forms to specifications of corresponding lexical words and morphosyntactic categories. By traversing such transducers
- a given word form can be analysed and lemmatised, or
- a lexical word together with a set of morphosyntactic tags will generate corresponding inflected word forms.
The DWDSmor automata are compiled and traversed via SFST, a C++ library and toolbox for finite-state transducers (FSTs). In addition, a DWDSmor Python library is provided, using the SFST Python bindings.
The coverage of the DWDSmor automata of the German language depends on
- a DWDSmor lexicon, which declares lexical entries with lemmas, stem forms, word classes, inflection classes, etc., and
- the DWDSmor grammar, which defines lemmatisation, inflection, and word-formation rules for written German.
While the DWDSmor grammar for word-formation is still work in progress, its inflection grammar is pretty comprehensive. The inflection grammar as well as the lexicon format are based on (heavily modified) code from SMORLemma, which in turn is derived from the Stuttgart Morphology (SMOR).
As a rule, the entries in a DWDSmor lexicon are extracted from a source lexicon comprising a set of XML files in the format of the DWDS dictionary.
From a DWDSmor lexicon and the DWDSmor grammar, a DWDSmor edition with several automata types can be compiled:
lemma: automaton with inflection and word-formation components, for lemmatisation and morphosyntactic analysis of word forms in terms of grammatical categories.lemma2: variant oflemma, for the generation of morphologically segmented word forms.finite: variant oflemmawith a finite word-formation component, for testing purposes.root: automaton with inflection and word-formation components, for lexical analysis of word forms in terms of root lemmas (i.e., lemmas of ultimate word-formation bases), word-formation process, word-formation means, and grammatical categories in term of the Pattern-and-Restriction Theory of word formation (Nolda 2022).root2: variant ofroot, for the generation of morphologically segmented word forms.index: automaton with an inflection component only with DWDS homographic lemma indices, for paradigm generation.
Automata are built in two formats: in standard format (with file extension .a)
for generation and in compact format (with file extension .ca) for analysis.
DWDSmor is released in two editions:
- the Open Edition, based on a sample selection of DWDS lemmas and their grammatical specifications, and
- the DWDS Edition, derived from the complete lexical dataset of the DWDS dictionary.
The DWDS Edition is only available upon request for research purposes while the Open Edition is released freely for general use and experiments.
The coverage of DWDSmor is benchmarked regularly against the German Universal Dependencies HDT treebank. In the DWDS Edition, the coverage ratios typically range from 95 % to 100 % for most word classes; notable exceptions include foreign-language words and named entities, which are barely part of the underlying DWDS dictionary and thus poorly covered by DWDSmor. In the Open Edition, the coverage ratios of open word classes are lower, due to the limited size of the sample source lexicon.
Usage
The DWDSmor Open Edition is available via the Python Package Index (PyPI):
pip install dwdsmor
The library can be used for lemmatisation:
>>> import dwdsmor
>>> lemmatizer = dwdsmor.lemmatizer()
>>> assert lemmatizer("getestet", pos={"V"}).analysis == "testen"
>>> assert lemmatizer("getestet", pos={"ADJ"}).analysis == "getestet"
There is also integration with spaCy:
pip install spacy de_zdl_lg --extra-index-url https://gitup.uni-potsdam.de/api/v4/projects/21461/packages/pypi/simple
>>> import spacy
>>> import dwdsmor.spacy
>>> nlp = spacy.load("de_zdl_lg")
>>> nlp.add_pipe("dwdsmor")
<dwdsmor.spacy.Component object at 0x7f99e634f220>
>>> tuple(t._.dwdsmor.analysis for t in nlp("Man sah neben diversen ICEs auch viele schöne Altbauten."))
('man', 'sehen', 'neben', 'divers', 'ICE', 'auch', 'viel', 'schön', 'Altbau', '.')
In addition to the Python API, the package provides a simple
command-line interface named dwdsmor. To analyze a word form, pass it
as an argument:
$ dwdsmor gebildet
Wordform Lemma Analysis POS Degree Function Nonfinite Tense Auxiliary
gebildet bilden bild<~>en<+V><Part><Perf><haben> V Part Perf haben
gebildet gebildet ge<~>bild<~>et<+ADJ><Pos><Pred/Adv> ADJ Pos Pred/Adv
To generate all word forms for a lexical word, pass it (or a form
which can be analyzed as the lexical word) as an argument together
with the option -g:
$ dwdsmor -g gebildet
[…]
Wordform Lemma Analysis POS Degree Function Person Gender Case Number Nonfinite Tense Mood Auxiliary Inflection
[…]
gebildete gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><St> ADJ Pos Attr/Subst Fem Acc Sg St
gebildete gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><Wk> ADJ Pos Attr/Subst Fem Acc Sg Wk
gebildeter gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><St> ADJ Pos Attr/Subst Fem Dat Sg St
gebildeten gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><Wk> ADJ Pos Attr/Subst Fem Dat Sg Wk
gebildeter gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><St> ADJ Pos Attr/Subst Fem Gen Sg St
gebildeten gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><Wk> ADJ Pos Attr/Subst Fem Gen Sg Wk
gebildete gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Nom><Sg><St> ADJ Pos Attr/Subst Fem Nom Sg St
gebildete gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Nom><Sg><Wk> ADJ Pos Attr/Subst Fem Nom Sg Wk
gebildeten gebildet gebildet<+ADJ><Pos><Attr/Subst><Masc><Acc><Sg><St> ADJ Pos Attr/Subst Masc Acc Sg St
[…]
bildeten bilden bild<~>en<+V><1><Pl><Past><Ind> V 1 Pl Past Ind
bildeten bilden bild<~>en<+V><1><Pl><Past><Subj> V 1 Pl Past Subj
bilden bilden bild<~>en<+V><1><Pl><Pres><Ind> V 1 Pl Pres Ind
bilden bilden bild<~>en<+V><1><Pl><Pres><Subj> V 1 Pl Pres Subj
bildete bilden bild<~>en<+V><1><Sg><Past><Ind> V 1 Sg Past Ind
bildete bilden bild<~>en<+V><1><Sg><Past><Subj> V 1 Sg Past Subj
bilde bilden bild<~>en<+V><1><Sg><Pres><Ind> V 1 Sg Pres Ind
bilde bilden bild<~>en<+V><1><Sg><Pres><Subj> V 1 Sg Pres Subj
bildetet bilden bild<~>en<+V><2><Pl><Past><Ind> V 2 Pl Past Ind
[…]
More sophisticated tools for analysis and paradigm generation with the
DWDSmor Python library are provided by the Python commands dwdsmor-analysis
and dwdsmor-paradigm:
$ echo gebildet | dwdsmor-analysis
Wordform Lemma POS Auxiliary Degree Nonfinite Function Tense
gebildet bilden V haben Part Perf
gebildet gebildet ADJ Pos Pred/Adv
$ dwdsmor-paradigm gebildet
Lemma POS Degree Gender Case Number Inflection Function Paradigm Forms
gebildet ADJ Pos Pred/Adv gebildet
gebildet ADJ Pos Masc Nom Sg St Attr/Subst gebildeter
gebildet ADJ Pos Masc Nom Sg Wk Attr/Subst gebildete
[…]
gebildet ADJ Pos Neut Nom Sg St Attr/Subst gebildetes
gebildet ADJ Pos Neut Nom Sg Wk Attr/Subst gebildete
[…]
gebildet ADJ Pos Fem Nom Sg St Attr/Subst gebildete
gebildet ADJ Pos Fem Nom Sg Wk Attr/Subst gebildete
[…]
gebildet ADJ Pos UnmGend Nom Pl St Attr/Subst gebildete
gebildet ADJ Pos UnmGend Nom Pl Wk Attr/Subst gebildeten
[…]
For more options, cf. the output of dwdsmor-analysis -h and
dwdsmor-paradigm -h.
Installing the DWDS Edition
Should you have been granted access to the DWDS edition, please add
your private access token to $HOME/.netrc:
machine gitup.uni-potsdam.de login gitlab-ci-token password glpat-…
Then install the edition:
pip install dwdsmor-dwds --index-url https://gitup.uni-potsdam.de/api/v4/projects/21585/packages/pypi/simple
Development
DWDSmor is in active development. In its current stage, it supports all major inflection classes and some productive word-formation patterns of written German.
Prerequisites
- GNU/Linux: Development, builds and tests of DWDSmor are performed on Debian GNU/Linux. While other UNIX-like operating systems such as MacOS should work, too, they are not actively supported.
- Python ≥ v3.12: DWDSmor provides a Python interface for building DWDSmor lexica from source lexica in the DWDS XML format and for compiling DWDSmor automata from the resulting DWDSmor lexica and the DWDSmor grammar.
- Saxon-HE: The entries in DWDSmor lexica are extracted from source lexica in the DWDS XML format by means of XSLT 2 stylesheets, using Saxon-HE as an XSLT processor. Saxon requires a Java runtime environment.
- SFST: The DWDSmor automata are compiled using the SFST C++ library and toolbox for finite-state transducers (FSTs).
On a Debian-based distribution, the following command installs the required software:
sudo apt install python3 default-jdk libsaxonhe-java sfst
Project setup
Optionally, set up a Python virtual environment for project builds,
i. e. via Python’s venv:
python3 -m venv .venv
source .venv/bin/activate
Then install DWDSmor, including development dependencies:
pip install -U pip setuptools && pip install -r requirements.dev.txt
Building the DWDS edition
Install additional dependencies:
pip install -U pip setuptools && pip install -r requirements.dwds.txt
Download the lexicon:
GITUP_PRIVATE_TOKEN="…" python -m dwdsmor.build.dwdswb
Building lexica and automata
Building different editions and automata is facilitated via the
dwdsmor.build module. To build the default Open Edition, simply run:
python -m dwdsmor.build
For more build options, run:
python -m dwdsmor.build -h
Testing
In order to test for basic automata functionality and potential regressions, run
pytest
License
As the original SMOR and SMORLemma grammars, the DWDSmor grammar and the DWDSmor Python library are licensed under the GNU General Public License v2.0. The same applies to the sample source lexicon and the automata of the Open Edition.
For the DWDS Edition based on the complete DWDS dictionary, all rights are reserved and individual license terms apply. If you are interested in the automata of the DWDS Edition, please contact us.
Contact
Feel free to contact Andreas Nolda for any question about this project.
Credits
DWSDmor is based on the following software and datasets:
- SFST, a C++ library and toolbox for finite-state transducers (FSTs) (Schmidt 2006).
- SMORLemma (Sennrich and Kunz 2014), a modified version of the Stuttgart Morphology (SMOR) (Schmid, Fitschen, and Heid 2004) with an alternative lemmatisation component.
- the DWDS dictionary (BBAW n.d.) replacing the IMSLex (Fitschen 2004) as the lexical data source for German words, their grammatical categories, and their morphosyntactic properties.
References
- Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.). DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. Online
- Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes System. Ph.D. thesis, Universität Stuttgart. PDF
- Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on compounding and blending in German. In Headedness and/or Grammatical Anarchy?, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press, 343–376. PDF.
- Schmid, Helmut (2006). A programming language for finite state transducers. In Finite-State Methods and Natural Language Processing: 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005, ed. by Anssi Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial Intelligence 4002, Berlin: Springer, 1263–1266. PDF.
- Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German computational morphology covering derivation, composition, and inflection. In LREC 2004: Fourth International Conference on Language Resources and Evaluation, ed. by Maria T. Lino et al., European Language Resources Association, 1263–1266. PDF
- Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon extracted from Wiktionary. In LREC 2014: Ninth International Conference on Language Resources and Evaluation, ed. by Nicoletta Calzolari et al., European Language Resources Association, 1063–1067. PDF.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dwdsmor-0.15.1.tar.gz.
File metadata
- Download URL: dwdsmor-0.15.1.tar.gz
- Upload date:
- Size: 4.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18f968623d75452fbdff4cb8edf692fe322a521944f6669940b73b883d2ed27b
|
|
| MD5 |
c420997675435bc892d983d84bd601c3
|
|
| BLAKE2b-256 |
7fa8b5e07791291bb680a28fbe1e4e0c492c23c7c37582b68f3373b8dff112eb
|
Provenance
The following attestation bundles were made for dwdsmor-0.15.1.tar.gz:
Publisher:
publish.yml on zentrum-lexikographie/dwdsmor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dwdsmor-0.15.1.tar.gz -
Subject digest:
18f968623d75452fbdff4cb8edf692fe322a521944f6669940b73b883d2ed27b - Sigstore transparency entry: 829226918
- Sigstore integration time:
-
Permalink:
zentrum-lexikographie/dwdsmor@90adfa89558920570e2fe3149fd1c23e88ca1bff -
Branch / Tag:
refs/tags/v0.15.1 - Owner: https://github.com/zentrum-lexikographie
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@90adfa89558920570e2fe3149fd1c23e88ca1bff -
Trigger Event:
release
-
Statement type:
File details
Details for the file dwdsmor-0.15.1-py3-none-any.whl.
File metadata
- Download URL: dwdsmor-0.15.1-py3-none-any.whl
- Upload date:
- Size: 4.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5147807f4223db063603bb2709d42a3ae41f4843e6fdca632be3eef7f4b0b209
|
|
| MD5 |
309345ca67598809fb47635ae08e1c2e
|
|
| BLAKE2b-256 |
784a8072c1385c3638ebddf8c81498cc8062c11d60358b18a8b08e3f0a0218a8
|
Provenance
The following attestation bundles were made for dwdsmor-0.15.1-py3-none-any.whl:
Publisher:
publish.yml on zentrum-lexikographie/dwdsmor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dwdsmor-0.15.1-py3-none-any.whl -
Subject digest:
5147807f4223db063603bb2709d42a3ae41f4843e6fdca632be3eef7f4b0b209 - Sigstore transparency entry: 829226920
- Sigstore integration time:
-
Permalink:
zentrum-lexikographie/dwdsmor@90adfa89558920570e2fe3149fd1c23e88ca1bff -
Branch / Tag:
refs/tags/v0.15.1 - Owner: https://github.com/zentrum-lexikographie
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@90adfa89558920570e2fe3149fd1c23e88ca1bff -
Trigger Event:
release
-
Statement type: