Extracts data from German Wiktionary dump files.
Project description
wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
Features
- Extracts flexion tables, IPA transcriptions, language, genus, lemma, part of speech information (basic) and syllables of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
Installation
pip install wiktionary-de-parser
Or with Poetry:
poetry add wiktionary-de-parser
Usage
The following example will download the latest Wiktionary dump file (from here) and parse all German entries.
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump
# Specify the directory where the dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")
# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()
# Alternatively you can also specify a different dump file to download.
dump = WiktionaryDump(
dump_dir_path="directory-of-dump-file",
dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()
# If you already have the dump file, you can also specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
# Next, we can parse the dump file.
parser = WiktionaryParser()
for page in dump.pages():
# Skip redirects
if page.redirect_to:
continue
for entry in parser.entries_from_page(page):
parsed = parser.parse_entry(entry)
# Ignore non-German entries
if parsed.language.lang_code != "de":
continue
# do something with "parsed
...
Output
All entries for "Abend":
ParsedWiktionaryPageEntry(
name="Abend",
flexion={
"Genus": "m",
"Nominativ Singular": "Abend",
"Nominativ Plural": "Abende",
"Genitiv Singular": "Abends",
"Genitiv Plural": "Abende",
"Dativ Singular": "Abend",
"Dativ Plural": "Abenden",
"Akkusativ Singular": "Abend",
"Akkusativ Plural": "Abende",
},
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": []},
rhymes=["aːbn̩t"],
syllables=["Abend"],
)
ParsedWiktionaryPageEntry(
name="Abend",
flexion=None,
ipa=["ˈaːbn̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Nachname"]},
rhymes=["aːbn̩t"],
syllables=["Abend"],
)
ParsedWiktionaryPageEntry(
name="Abend",
flexion=None,
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Toponym"]},
rhymes=["aːbn̩t"],
syllables=["Abend"],
)
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - There is a
notebook.ipynb
to test the parser. - Run
poetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for wiktionary_de_parser-0.11.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6737e019a22ae9333dca23ca6ec3e23cdbf0116179e46828cdf0b38fd6a2ee87 |
|
MD5 | a9e2277c0ec7b3ef942f9b4c05972796 |
|
BLAKE2b-256 | ed89dbef5cb9a1d8867ff84bf122ca4358fb0bb3ed61f19140ad1487adb5d591 |
Close
Hashes for wiktionary_de_parser-0.11.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 26eb4fe865d3345ac8e0c5bea8686af743ceba7d66501aae36a6d7796c7c6c1e |
|
MD5 | b5fb5f22fe73d969cedff699fff075cd |
|
BLAKE2b-256 | 190a349f97f9715430e576ce167dcde592009d679e91cede7cd6d58d92e50272 |