Extracts data from German Wiktionary dump files.
Project description
wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
Features
- Extracts IPA transcriptions, hyphenation, language, part of speech information (basic), genus and flexion tables of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
Installation
pip install wiktionary-de-parser
Or with Poetry:
poetry add wiktionary-de-parser
Usage
Loading the XML dump file
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump
# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")
# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()
# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
dump_dir_path="directory-of-dump-file",
dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()
# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
Parsing the dump file
from pprint import pprint
from wiktionary_de_parser import WiktionaryParser
# ... (see above)
parser = WiktionaryParser()
for page in dump.pages():
# Skip redirects
if page.redirect_to:
continue
if page.name == "Abend":
# Parse all entries for "Abend"
for entry in parser.entries_from_page(page):
results = parser.parse_entry(entry)
pprint(results)
break
Output
All page entries for "Abend":
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion={
"Genus": "m",
"Nominativ Singular": "Abend",
"Nominativ Plural": "Abende",
"Genitiv Singular": "Abends",
"Genitiv Plural": "Abende",
"Dativ Singular": "Abend",
"Dativ Plural": "Abenden",
"Akkusativ Singular": "Abend",
"Akkusativ Plural": "Abende",
},
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": []},
rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion=None,
ipa=["ˈaːbn̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Nachname"]},
rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion=None,
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Toponym"]},
rhymes=["aːbn̩t"],
)
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - There is a
notebook.ipynb
to test the parser. - Run
poetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for wiktionary_de_parser-0.12.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9e147ebd0a8d83f792ae6ffd5d811f7183e875aefefc6c0d41df478f82e04dd |
|
MD5 | 2b373eede6e4cae7c5e21a246119750c |
|
BLAKE2b-256 | 0b140e8ebfe4e62462896239507a32d25a8899b3986214d725478d823462c6c6 |
Close
Hashes for wiktionary_de_parser-0.12.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b5d49a5acf3f557e054037edc48a0d4e9f931f4c3c989bfc1032348955c28e1 |
|
MD5 | 3567e7595ef7a72894f84549853a3126 |
|
BLAKE2b-256 | e95449b92aa92d31c93a2e82b5a74fa5c9bec4c4a20a2b7a3298a5d6d1831691 |