Extracts data from German Wiktionary dump files.

These details have not been verified by PyPI

Project links

Project description

wiktionary-de-parser

A Python library (3.13+) that extracts structured data from German Wiktionary XML dumps: IPA, hyphenation, inflection tables, part-of-speech tags, lemma references, rhymes, and meanings.

Features

Streams compressed XML dumps memory-efficiently.
Yields one structured entry per language and part of speech (a single Wiktionary page often holds several).
Optional multiprocessing mode for full-dump throughput.

Installation

pip install wiktionary-de-parser

The project uses uv for development; any standard pip/PyPI install works for consumers.

Usage

Locating the dump file

from wiktionary_de_parser import WiktionaryDump

# Either point at an existing local file.
dump = WiktionaryDump(
    dump_file_path="path/to/dewiktionary-latest-pages-articles-multistream.xml.bz2"
)

# Or download into a directory on first call.
dump = WiktionaryDump(dump_dir_path="dumps/")
dump.download_dump()

Parsing entries (serial)

from wiktionary_de_parser import WiktionaryParser

parser = WiktionaryParser()

for page in dump.pages():
    if page.redirect_to or not page.wikitext:
        continue
    for entry in parser.entries(page):
        parsed = parser.parse(entry)
        if parsed.page_name == "Abend":
            print(parsed)

Parsing entries (parallel)

For full-dump runs use iter_parsed. XML iteration stays on the main process while parsing is sharded over a worker pool.

for parsed in dump.iter_parsed(workers=15):
    ...  # ParsedEntry instances yielded across all workers

workers defaults to os.cpu_count() - 1. Pass workers=1 to skip multiprocessing entirely (useful with pdb).

Output schema

ParsedEntry(
    page_name="Abend",
    page_id=2742,
    entry_index=0,
    language="Deutsch",
    language_code="de",
    lemma="Abend",
    reference=None,                          # LemmaReference if the page is an inflected/variant form
    pos=[PosTag(pos="Substantiv", subtypes=())],
    inflection={
        "gender": "m",
        "nominative_singular": "Abend",
        "nominative_plural": "Abende",
        "genitive_singular": "Abends",
        "genitive_plural": "Abende",
        "dative_singular": "Abend",
        "dative_plural": "Abenden",
        "accusative_singular": "Abend",
        "accusative_plural": "Abende",
    },
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    hyphenation=["Abend"],
    rhymes=["aːbn̩t"],
    meanings=[Meaning(text="…", tags=["Astronomie"], raw_tags=[])],
)

All result containers are @dataclass(slots=True). The full schema lives in wiktionary_de_parser/models.py.

Inflection keys

Inflection-table parameter names are token-translated to English lowercase + underscore: "Nominativ Singular" → "nominative_singular", "Präsens_er, sie, es" → "present_3sg". Unknown tokens are kept verbatim (lowercased).

Lemma references

If the entry is an inflected form or alternative spelling, lemma holds the canonical target and reference records the type:

# "gehörte" → "gehören"
parsed.lemma == "gehören"
parsed.reference == LemmaReference(target="gehören", type=ReferenceType.INFLECTED)

# "Geografie" → "Geographie"
parsed.reference == LemmaReference(target="Geographie", type=ReferenceType.VARIANT)

Development

uv sync                 # install dependencies
uv run pytest           # run the test suite
uv run ruff format
uv run ruff check

License

MIT © Gregor Weichbrodt

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.16.0

May 29, 2026

0.15.0

May 27, 2026

0.14.2

May 22, 2026

0.14.1

May 22, 2026

0.14.0

May 21, 2026

0.13.1

Nov 16, 2025

0.13.0

Nov 16, 2025

0.12.15

Nov 16, 2025

0.12.14

Nov 16, 2025

0.12.13

Jan 2, 2025

0.12.12

Jan 1, 2025

0.12.11

Jan 1, 2025

0.12.10

Jan 1, 2025

0.12.9

Dec 31, 2024

0.12.8

Dec 31, 2024

0.12.7

Dec 31, 2024

0.12.6

Dec 31, 2024

0.12.5

Dec 30, 2024

0.12.4

Dec 30, 2024

0.12.3

Dec 30, 2024

0.12.2

Dec 29, 2024

0.12.1

Dec 29, 2024

0.12.0

Jul 29, 2024

0.11.5

Feb 10, 2024

0.11.4

Feb 10, 2024

0.11.3

Feb 10, 2024

0.11.2

Feb 9, 2024

0.11.1

Feb 5, 2024

0.11.0

Feb 4, 2024

0.10.1

Jan 29, 2024

0.10.0

Jan 29, 2024

0.9.5

Jul 26, 2022

0.9.4

Jul 18, 2022

0.9.3

Jul 18, 2022

0.9.2

Jul 17, 2022

0.9.1

Jul 15, 2022

0.9.0

Jul 15, 2022

0.8.9

Nov 13, 2021

0.8.8

Nov 12, 2021

0.8.7

Nov 12, 2021

0.8.6

Nov 12, 2021

0.8.5

Nov 12, 2021

0.8.4

Nov 12, 2021

0.8.3

Nov 12, 2021

0.8.2

Nov 10, 2021

0.8.1

Jul 9, 2020

0.8.0

Dec 1, 2019

0.7.9

Dec 1, 2019

0.7.8

Dec 1, 2019

0.7.7

Jul 16, 2019

0.7.6

Jul 13, 2019

0.7.5

Jul 13, 2019

0.7.4

Jul 13, 2019

0.7.3

May 29, 2019

0.7.2

May 29, 2019

0.7.1

May 27, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiktionary_de_parser-0.16.0.tar.gz (39.3 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wiktionary_de_parser-0.16.0-py3-none-any.whl (39.5 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file wiktionary_de_parser-0.16.0.tar.gz.

File metadata

Download URL: wiktionary_de_parser-0.16.0.tar.gz
Upload date: May 29, 2026
Size: 39.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for wiktionary_de_parser-0.16.0.tar.gz
Algorithm	Hash digest
SHA256	`477c27d8381e424b88361110b8036ef7ba49b1777adcd74bcdeabb302c4b5589`
MD5	`772ab1f2e70305957405d476619c7129`
BLAKE2b-256	`bdba23defd9a23815fe539dbd3e5fc1b994e6ccae5cb7ef19e4230fccfda89f9`

See more details on using hashes here.

File details

Details for the file wiktionary_de_parser-0.16.0-py3-none-any.whl.

File metadata

Download URL: wiktionary_de_parser-0.16.0-py3-none-any.whl
Upload date: May 29, 2026
Size: 39.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for wiktionary_de_parser-0.16.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eb1abe33e49f57fc2af9dbbc353fc198920bb4d9c1dcaea1fb527fa766d18997`
MD5	`07156ab8a2da5ba22aaebbb7c7d177c9`
BLAKE2b-256	`06b131224449b0ef9cf173a2e9514c660fa4ff9ef52f2960e867f835791156b4`

See more details on using hashes here.

wiktionary-de-parser 0.16.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

wiktionary-de-parser

Features

Installation

Usage

Locating the dump file

Parsing entries (serial)

Parsing entries (parallel)

Output schema

Inflection keys

Lemma references

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes