Extracts data from German Wiktionary dump files.
Project description
wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
Features
- Extracts IPA transcriptions, hyphenation, language, part of speech information (basic), genus and flexion tables of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
Installation
pip install wiktionary-de-parser
Or with Poetry:
poetry add wiktionary-de-parser
Usage
Loading the XML dump file
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump
# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")
# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()
# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
dump_dir_path="directory-of-dump-file",
dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()
# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
Parsing the dump file
from pprint import pprint
from wiktionary_de_parser import WiktionaryParser
# ... (see above)
parser = WiktionaryParser()
for page in dump.pages():
if page.redirect_to:
continue
if page.name == "Abend":
for entry in parser.entries_from_page(page):
results = parser.parse_entry(entry)
pprint(results)
break
Output
All page entries for "Abend":
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion={
"Genus": "m",
"Nominativ Singular": "Abend",
"Nominativ Plural": "Abende",
"Genitiv Singular": "Abends",
"Genitiv Plural": "Abende",
"Dativ Singular": "Abend",
"Dativ Plural": "Abenden",
"Akkusativ Singular": "Abend",
"Akkusativ Plural": "Abende",
},
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": []},
rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion=None,
ipa=["ˈaːbn̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Nachname"]},
rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion=None,
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Toponym"]},
rhymes=["aːbn̩t"],
)
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - There is a
notebook.ipynb
to test the parser. - Run
poetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for wiktionary_de_parser-0.11.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ce0f6deb83964b16f3f6b96ee200f9190eda3244cffe514b0b16c23a24b9b1c |
|
MD5 | 98468e7eae981c8cb3e8cf2a89339007 |
|
BLAKE2b-256 | 45d870efe1684db2071fec256aca83cb5dcce1acc9759749a27147300a7153ec |
Close
Hashes for wiktionary_de_parser-0.11.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9664b5f3904e63abe7a2e78f38f466138d6c8b76d6701cba22fc594be431fc2 |
|
MD5 | fdf9cb5a2ac0846b0ca9162a3ce7329f |
|
BLAKE2b-256 | 4ea5e9337f66d7acbf9d9abc4390e2d50794b3ebb3c6dfb8b88b2e8fcb28dc7a |