Skip to main content

Extracts data from German Wiktionary dump files.

Project description

wiktionary-de-parser

A Python module to extract data from German Wiktionary XML files (for Python 3.11+).

Features

  • Extracts IPA transcriptions, hyphenation, language, part of speech information (basic), genus and flexion tables of a word.
  • Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)

Installation

pip install wiktionary-de-parser

Or with Poetry:

poetry add wiktionary-de-parser

Usage

Loading the XML dump file

from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump

# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")

# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()

# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
    dump_dir_path="directory-of-dump-file",
    dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()

# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()

Parsing the dump file

from pprint import pprint
from wiktionary_de_parser import WiktionaryParser

# ... (see above)

parser = WiktionaryParser()

for page in dump.pages():
    # Skip redirects
    if page.redirect_to:
        continue

    if page.name == "Abend":
        # Parse all entries for "Abend"
        for entry in parser.entries_from_page(page):
            results = parser.parse_entry(entry)
            pprint(results)
        break

Output

All page entries for "Abend":

ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion={
        "Genus": "m",
        "Nominativ Singular": "Abend",
        "Nominativ Plural": "Abende",
        "Genitiv Singular": "Abends",
        "Genitiv Plural": "Abende",
        "Dativ Singular": "Abend",
        "Dativ Plural": "Abenden",
        "Akkusativ Singular": "Abend",
        "Akkusativ Plural": "Abende",
    },
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", reference_type=<ReferenceType.NONE: 'none'>),
    pos={"Substantiv": []},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", reference_type=<ReferenceType.NONE: 'none'>),
    pos={"Substantiv": ["Nachname"]},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", reference_type=<ReferenceType.NONE: 'none'>),
    pos={"Substantiv": ["Toponym"]},
    rhymes=["aːbn̩t"],
)

Development

This project uses Poetry.

  1. Install Poetry.
  2. Clone this repository
  3. Run poetry install inside of the project folder to install dependencies.
  4. There is a notebook.ipynb to test the parser.
  5. Run poetry run pytest to run tests.

License

MIT © Gregor Weichbrodt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiktionary_de_parser-0.14.2.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wiktionary_de_parser-0.14.2-py3-none-any.whl (39.7 kB view details)

Uploaded Python 3

File details

Details for the file wiktionary_de_parser-0.14.2.tar.gz.

File metadata

  • Download URL: wiktionary_de_parser-0.14.2.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for wiktionary_de_parser-0.14.2.tar.gz
Algorithm Hash digest
SHA256 eca6d411c36e4f7f2932de48b447fe8adbe5fa20570715e463ce50f55274b53d
MD5 41c1cd185b9d45f9417fb7c6bbc79018
BLAKE2b-256 31d544a4e6ab95f7681a3ef86aa793da0102443d3e3a24a44aa75cb602856739

See more details on using hashes here.

File details

Details for the file wiktionary_de_parser-0.14.2-py3-none-any.whl.

File metadata

  • Download URL: wiktionary_de_parser-0.14.2-py3-none-any.whl
  • Upload date:
  • Size: 39.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for wiktionary_de_parser-0.14.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dedc6ee8d4041c565170c7c74ab8217d2d302fb255331a62509efeb6c82816e6
MD5 6f0cd982f509c0d990e489c9812fc162
BLAKE2b-256 dd03ac7ba08453dd3e86f106cb3ee97478a2babaa557db13e600f3eb29c98483

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page