Skip to main content

SFST/SMOR/DWDS-based German morphology

Project description

DWDSmor – German Morphology

PyPI - Version PyPI - Python Version GitHub License

DWDSmor implements the lemmatisation and morphosyntactic analysis of word forms as well as the generation of paradigms of lexical words in written German. DWDSmor’s finite state transducers (the DWDSmor automata) map word forms to specifications of corresponding lexical words and morphosyntactic categories. By traversing such transducers

  1. a given word form can be analysed and lemmatised, or
  2. a lexical word together with a set of morphosyntactic tags will generate corresponding inflected word forms.

The DWDSmor automata are compiled and traversed via SFST, a C++ library and toolbox for finite-state transducers (FSTs). In addition, a DWDSmor Python library is provided, using the SFST Python bindings.

The coverage of the DWDSmor automata of the German language depends on

  1. a DWDSmor lexicon, which declares lexical entries with lemmas, stem forms, word classes, inflection classes, etc., and
  2. the DWDSmor grammar, which defines lemmatisation, inflection, and word-formation rules for written German.

While the DWDSmor grammar for word-formation is still work in progress, its inflection grammar is pretty comprehensive. The inflection grammar as well as the lexicon format are based on (heavily modified) code from SMORLemma, which in turn is derived from the Stuttgart Morphology (SMOR).

As a rule, the entries in a DWDSmor lexicon are extracted from a source lexicon comprising a set of XML files in the format of the DWDS dictionary.

From a DWDSmor lexicon and the DWDSmor grammar, a DWDSmor edition with several automata types can be compiled:

  • lemma: automaton with inflection and word-formation components, for lemmatisation and morphosyntactic analysis of word forms in terms of grammatical categories.
  • lemma2: variant of lemma, for the generation of morphologically segmented word forms.
  • finite: variant of lemma with a finite word-formation component, for testing purposes.
  • root: automaton with inflection and word-formation components, for lexical analysis of word forms in terms of root lemmas (i.e., lemmas of ultimate word-formation bases), word-formation process, word-formation means, and grammatical categories in term of the Pattern-and-Restriction Theory of word formation (Nolda 2022).
  • root2: variant of root, for the generation of morphologically segmented word forms.
  • index: automaton with an inflection component only with DWDS homographic lemma indices, for paradigm generation.

Automata are built in two formats: in standard format (with file extension .a) for generation and in compact format (with file extension .ca) for analysis.

DWDSmor is released in two editions:

  1. the Open Edition, based on a sample selection of DWDS lemmas and their grammatical specifications, and
  2. the DWDS Edition, derived from the complete lexical dataset of the DWDS dictionary.

The DWDS Edition is only available upon request for research purposes while the Open Edition is released freely for general use and experiments.

The coverage of DWDSmor is benchmarked regularly against the German Universal Dependencies HDT treebank. In the DWDS Edition, the coverage ratios typically range from 95 % to 100 % for most word classes; notable exceptions include foreign-language words and named entities, which are barely part of the underlying DWDS dictionary and thus poorly covered by DWDSmor. In the Open Edition, the coverage ratios of open word classes are lower, due to the limited size of the sample source lexicon.

Usage

The DWDSmor Open Edition is available via the Python Package Index (PyPI):

pip install dwdsmor

The library can be used for lemmatisation:

>>> import dwdsmor
>>> lemmatizer = dwdsmor.lemmatizer()
>>> assert lemmatizer("getestet", pos={"V"}).analysis == "testen"
>>> assert lemmatizer("getestet", pos={"ADJ"}).analysis == "getestet"

There is also integration with spaCy:

pip install spacy de_zdl_lg --extra-index-url https://gitup.uni-potsdam.de/api/v4/projects/21461/packages/pypi/simple
>>> import spacy
>>> import dwdsmor.spacy
>>> nlp = spacy.load("de_zdl_lg")
>>> nlp.add_pipe("dwdsmor")
<dwdsmor.spacy.Component object at 0x7f99e634f220>
>>> tuple(t._.dwdsmor.analysis for t in nlp("Man sah neben diversen ICEs auch viele schöne Altbauten."))
('man', 'sehen', 'neben', 'divers', 'ICE', 'auch', 'viel', 'schön', 'Altbau', '.')

In addition to the Python API, the package provides a simple command-line interface named dwdsmor. To analyze a word form, pass it as an argument:

$ dwdsmor gebildet
Wordform  	Lemma   	Analysis                           	POS  	Degree  	Function  	Nonfinite  	Tense  	Auxiliary
gebildet  	bilden  	bild<~>en<+V><Part><Perf><haben>   	V   	        	          	Part       	Perf   	haben
gebildet  	gebildet	ge<~>bild<~>et<+ADJ><Pos><Pred/Adv>	ADJ 	Pos     	Pred/Adv

To generate all word forms for a lexical word, pass it (or a form which can be analyzed as the lexical word) as an argument together with the option -g:

$ dwdsmor -g gebildet
[…]
Wordform  	Lemma   	Analysis                                             	POS  	Degree  	Function  	  Person	Gender  	Case  	Number  	Nonfinite  	Tense  	Mood  	Auxiliary  	Inflection
[…]
gebildete 	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><St>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Acc   	Sg      	           	       	      	           	St
gebildete 	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><Wk>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Acc   	Sg      	           	       	      	           	Wk
gebildeter	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><St>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Dat   	Sg      	           	       	      	           	St
gebildeten	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><Wk>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Dat   	Sg      	           	       	      	           	Wk
gebildeter	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><St>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Gen   	Sg      	           	       	      	           	St
gebildeten	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><Wk>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Gen   	Sg      	           	       	      	           	Wk
gebildete 	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Nom><Sg><St>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Nom   	Sg      	           	       	      	           	St
gebildete 	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Nom><Sg><Wk>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Nom   	Sg      	           	       	      	           	Wk
gebildeten	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Masc><Acc><Sg><St>   	ADJ 	Pos     	Attr/Subst	        	Masc    	Acc   	Sg      	           	       	      	           	St
[…]
bildeten  	bilden  	bild<~>en<+V><1><Pl><Past><Ind>                      	V   	        	          	       1	        	      	Pl      	           	Past   	Ind
bildeten  	bilden  	bild<~>en<+V><1><Pl><Past><Subj>                     	V   	        	          	       1	        	      	Pl      	           	Past   	Subj
bilden    	bilden  	bild<~>en<+V><1><Pl><Pres><Ind>                      	V   	        	          	       1	        	      	Pl      	           	Pres   	Ind
bilden    	bilden  	bild<~>en<+V><1><Pl><Pres><Subj>                     	V   	        	          	       1	        	      	Pl      	           	Pres   	Subj
bildete   	bilden  	bild<~>en<+V><1><Sg><Past><Ind>                      	V   	        	          	       1	        	      	Sg      	           	Past   	Ind
bildete   	bilden  	bild<~>en<+V><1><Sg><Past><Subj>                     	V   	        	          	       1	        	      	Sg      	           	Past   	Subj
bilde     	bilden  	bild<~>en<+V><1><Sg><Pres><Ind>                      	V   	        	          	       1	        	      	Sg      	           	Pres   	Ind
bilde     	bilden  	bild<~>en<+V><1><Sg><Pres><Subj>                     	V   	        	          	       1	        	      	Sg      	           	Pres   	Subj
bildetet  	bilden  	bild<~>en<+V><2><Pl><Past><Ind>                      	V   	        	          	       2	        	      	Pl      	           	Past   	Ind
[…]

More sophisticated tools for analysis and paradigm generation with the DWDSmor Python library are provided by the Python commands dwdsmor-analysis and dwdsmor-paradigm:

$ echo gebildet | dwdsmor-analysis
Wordform  Lemma     POS  Auxiliary  Degree  Nonfinite  Function  Tense
gebildet  bilden    V    haben              Part                 Perf
gebildet  gebildet  ADJ             Pos                Pred/Adv
$ dwdsmor-paradigm gebildet
Lemma     POS  Degree  Gender   Case  Number  Inflection  Function    Paradigm Forms
gebildet  ADJ  Pos                                        Pred/Adv    gebildet
gebildet  ADJ  Pos     Masc     Nom   Sg      St          Attr/Subst  gebildeter
gebildet  ADJ  Pos     Masc     Nom   Sg      Wk          Attr/Subst  gebildete
[…]
gebildet  ADJ  Pos     Neut     Nom   Sg      St          Attr/Subst  gebildetes
gebildet  ADJ  Pos     Neut     Nom   Sg      Wk          Attr/Subst  gebildete
[…]
gebildet  ADJ  Pos     Fem      Nom   Sg      St          Attr/Subst  gebildete
gebildet  ADJ  Pos     Fem      Nom   Sg      Wk          Attr/Subst  gebildete
[…]
gebildet  ADJ  Pos     UnmGend  Nom   Pl      St          Attr/Subst  gebildete
gebildet  ADJ  Pos     UnmGend  Nom   Pl      Wk          Attr/Subst  gebildeten
[…]

For more options, cf. the output of dwdsmor-analysis -h and dwdsmor-paradigm -h.

Installing the DWDS Edition

Should you have been granted access to the DWDS edition, please add your private access token to $HOME/.netrc:

machine gitup.uni-potsdam.de login gitlab-ci-token password glpat-…

Then install the edition:

pip install dwdsmor-dwds --index-url https://gitup.uni-potsdam.de/api/v4/projects/21585/packages/pypi/simple

Development

DWDSmor is in active development. In its current stage, it supports all major inflection classes and some productive word-formation patterns of written German.

Prerequisites

  • GNU/Linux: Development, builds and tests of DWDSmor are performed on Debian GNU/Linux. While other UNIX-like operating systems such as MacOS should work, too, they are not actively supported.
  • Python ≥ v3.12: DWDSmor provides a Python interface for building DWDSmor lexica from source lexica in the DWDS XML format and for compiling DWDSmor automata from the resulting DWDSmor lexica and the DWDSmor grammar.
  • Saxon-HE: The entries in DWDSmor lexica are extracted from source lexica in the DWDS XML format by means of XSLT 2 stylesheets, using Saxon-HE as an XSLT processor. Saxon requires a Java runtime environment.
  • SFST: The DWDSmor automata are compiled using the SFST C++ library and toolbox for finite-state transducers (FSTs).

On a Debian-based distribution, the following command installs the required software:

sudo apt install python3 default-jdk libsaxonhe-java sfst

Project setup

Optionally, set up a Python virtual environment for project builds, i. e. via Python’s venv:

python3 -m venv .venv
source .venv/bin/activate

Then install DWDSmor, including development dependencies:

pip install -U pip setuptools && pip install -r requirements.dev.txt

Building the DWDS edition

Install additional dependencies:

pip install -U pip setuptools && pip install -r requirements.dwds.txt

Download the lexicon:

GITUP_PRIVATE_TOKEN="…"  python -m dwdsmor.build.dwdswb

Building lexica and automata

Building different editions and automata is facilitated via the dwdsmor.build module. To build the default Open Edition, simply run:

python -m dwdsmor.build

For more build options, run:

python -m dwdsmor.build -h

Testing

In order to test for basic automata functionality and potential regressions, run

pytest

License

As the original SMOR and SMORLemma grammars, the DWDSmor grammar and the DWDSmor Python library are licensed under the GNU General Public License v2.0. The same applies to the sample source lexicon and the automata of the Open Edition.

For the DWDS Edition based on the complete DWDS dictionary, all rights are reserved and individual license terms apply. If you are interested in the automata of the DWDS Edition, please contact us.

Contact

Feel free to contact Andreas Nolda for any question about this project.

Credits

DWSDmor is based on the following software and datasets:

  1. SFST, a C++ library and toolbox for finite-state transducers (FSTs) (Schmidt 2006).
  2. SMORLemma (Sennrich and Kunz 2014), a modified version of the Stuttgart Morphology (SMOR) (Schmid, Fitschen, and Heid 2004) with an alternative lemmatisation component.
  3. the DWDS dictionary (BBAW n.d.) replacing the IMSLex (Fitschen 2004) as the lexical data source for German words, their grammatical categories, and their morphosyntactic properties.

References

  • Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.). DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. Online
  • Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes System. Ph.D. thesis, Universität Stuttgart. PDF
  • Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on compounding and blending in German. In Headedness and/or Grammatical Anarchy?, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press, 343–376. PDF.
  • Schmid, Helmut (2006). A programming language for finite state transducers. In Finite-State Methods and Natural Language Processing: 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005, ed. by Anssi Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial Intelligence 4002, Berlin: Springer, 1263–1266. PDF.
  • Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German computational morphology covering derivation, composition, and inflection. In LREC 2004: Fourth International Conference on Language Resources and Evaluation, ed. by Maria T. Lino et al., European Language Resources Association, 1263–1266. PDF
  • Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon extracted from Wiktionary. In LREC 2014: Ninth International Conference on Language Resources and Evaluation, ed. by Nicoletta Calzolari et al., European Language Resources Association, 1063–1067. PDF.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dwdsmor-0.15.1.tar.gz (4.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dwdsmor-0.15.1-py3-none-any.whl (4.7 MB view details)

Uploaded Python 3

File details

Details for the file dwdsmor-0.15.1.tar.gz.

File metadata

  • Download URL: dwdsmor-0.15.1.tar.gz
  • Upload date:
  • Size: 4.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dwdsmor-0.15.1.tar.gz
Algorithm Hash digest
SHA256 18f968623d75452fbdff4cb8edf692fe322a521944f6669940b73b883d2ed27b
MD5 c420997675435bc892d983d84bd601c3
BLAKE2b-256 7fa8b5e07791291bb680a28fbe1e4e0c492c23c7c37582b68f3373b8dff112eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for dwdsmor-0.15.1.tar.gz:

Publisher: publish.yml on zentrum-lexikographie/dwdsmor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dwdsmor-0.15.1-py3-none-any.whl.

File metadata

  • Download URL: dwdsmor-0.15.1-py3-none-any.whl
  • Upload date:
  • Size: 4.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dwdsmor-0.15.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5147807f4223db063603bb2709d42a3ae41f4843e6fdca632be3eef7f4b0b209
MD5 309345ca67598809fb47635ae08e1c2e
BLAKE2b-256 784a8072c1385c3638ebddf8c81498cc8062c11d60358b18a8b08e3f0a0218a8

See more details on using hashes here.

Provenance

The following attestation bundles were made for dwdsmor-0.15.1-py3-none-any.whl:

Publisher: publish.yml on zentrum-lexikographie/dwdsmor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page