Extracts data from German Wiktionary dump files.
Project description
wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
Features
- Extracts IPA transcriptions, hyphenation, language, part of speech information (basic), genus and flexion tables of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
Installation
pip install wiktionary-de-parser
Or with Poetry:
poetry add wiktionary-de-parser
Usage
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump
# Specify the directory where the dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")
# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()
# Alternatively you can also specify a different dump file to download.
dump = WiktionaryDump(
dump_dir_path="directory-of-dump-file",
dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()
# If you already have the dump file, you can specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
# Next, we can parse the dump file.
parser = WiktionaryParser()
for page in dump.pages():
# Skip redirects
if page.redirect_to:
continue
for entry in parser.entries_from_page(page):
parsed = parser.parse_entry(entry)
# Ignore non-German entries
if parsed.language.lang_code != "de":
continue
# do something with "parsed
...
Output
All entries for "Abend":
ParsedWiktionaryPageEntry(
name="Abend",
flexion={
"Genus": "m",
"Nominativ Singular": "Abend",
"Nominativ Plural": "Abende",
"Genitiv Singular": "Abends",
"Genitiv Plural": "Abende",
"Dativ Singular": "Abend",
"Dativ Plural": "Abenden",
"Akkusativ Singular": "Abend",
"Akkusativ Plural": "Abende",
},
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": []},
rhymes=["aːbn̩t"],
hyphenation=["Abend"],
)
ParsedWiktionaryPageEntry(
name="Abend",
flexion=None,
ipa=["ˈaːbn̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Nachname"]},
rhymes=["aːbn̩t"],
hyphenation=["Abend"],
)
ParsedWiktionaryPageEntry(
name="Abend",
flexion=None,
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Toponym"]},
rhymes=["aːbn̩t"],
hyphenation=["Abend"],
)
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - There is a
notebook.ipynb
to test the parser. - Run
poetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for wiktionary_de_parser-0.11.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b1e5eeaf8850a64fec64bbfbb02064ac71918b8c6cb03fd69ba6500fade1fc82 |
|
MD5 | ea59af7a8c523d785923783843882b0e |
|
BLAKE2b-256 | 1e33fbdd0b87cd293b8923e371d99381a10b8978ac90e5887191505834a7c453 |
Close
Hashes for wiktionary_de_parser-0.11.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b92ba192d833029c0c565ff2d2400158ee6f2bde5e8db66e4090b4462734cd5e |
|
MD5 | dfb1d2d559327c64fe5772764342f90e |
|
BLAKE2b-256 | 91d8d55f6b9fa628452e33d33fa4a65ffa3a0061a0ed3fdb6fe1f92a68257e35 |