Extracts data from German Wiktionary dump files.
Project description
wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
Features
- Extracts flexion tables, IPA transcriptions, language, genus, lemma, part of speech information (basic) and syllables of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
Installation
pip install wiktionary-de-parser
Or with Poetry:
poetry add wiktionary-de-parser
Usage
The following example will download the latest Wiktionary dump file (from here) and parse all German entries.
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump
# Specify the directory where the dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")
# Download latest Wiktionary xml dump file.
# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to the directory specified in `dump_dir_path`.
dump.download_latest_dump()
parser = WiktionaryParser()
for page in dump.pages():
# Skip redirects
if page.redirect_to:
continue
for entry in parser.entries_from_page(page):
parsed = parser.parse_entry(entry)
# Ignore non-German entries
if parsed.language.lang_code != "de":
continue
# do something with "parsed
...
Output
All entries for "Abend":
ParsedWiktionaryPageEntry(
name="Abend",
flexion={
"Genus": "m",
"Nominativ Singular": "Abend",
"Nominativ Plural": "Abende",
"Genitiv Singular": "Abends",
"Genitiv Plural": "Abende",
"Dativ Singular": "Abend",
"Dativ Plural": "Abenden",
"Akkusativ Singular": "Abend",
"Akkusativ Plural": "Abende",
},
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": []},
rhymes=["aːbn̩t"],
syllables=["Abend"],
)
ParsedWiktionaryPageEntry(
name="Abend",
flexion=None,
ipa=["ˈaːbn̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Nachname"]},
rhymes=["aːbn̩t"],
syllables=["Abend"],
)
ParsedWiktionaryPageEntry(
name="Abend",
flexion=None,
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Toponym"]},
rhymes=["aːbn̩t"],
syllables=["Abend"],
)
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - There is a
notebook.ipynb
to test the parser. - Run
poetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for wiktionary_de_parser-0.11.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f98a449b3b449365cfb708b6e1543cd1ed242b104b875ea245d3c098e1a31fd |
|
MD5 | a43dc5dfe7d7dc2a30b3bda4af5b4269 |
|
BLAKE2b-256 | 3c444d63689d57a97a10ce4f3a35599ea3646ad0bce28f19d7f36c0925ab726f |
Close
Hashes for wiktionary_de_parser-0.11.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 10494a6551c3a12c4592c7c3c4d19a751aefee503c253f8b8c6fc2fcb4bfd3bb |
|
MD5 | 8d236dc2997b2a5ab859d32735ba7b59 |
|
BLAKE2b-256 | 42b447405996d6602a66ba96d818594587c0168bf9b4fa696853875e41df537f |