Extracts data from German Wiktionary dump files.
Project description
wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
Features
- Extracts flexion tables, IPA transcriptions, language, genus, lemma, part of speech information (basic) and syllables of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
Installation
pip install wiktionary-de-parser
Or with Poetry:
poetry add wiktionary-de-parser
Usage
from bz2 import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)
for record in Parser(bz_file):
if record.lang_code != 'de':
continue
# do stuff with 'record'
Note: In this example we load a compressed Wiktionary dump file that was obtained from here.
Output
Example output for the page "Abend":
Record(lemma='Abend',
inflected=False,
syllables=['Abend'],
ipa=['ˈaːbn̩t', 'ˈaːbm̩t'],
rhymes=['aːbn̩t'],
pos={'Substantiv': []},
lang='Deutsch',
lang_code='de',
flexion={'Akkusativ Plural': 'Abende',
'Akkusativ Singular': 'Abend',
'Dativ Plural': 'Abenden',
'Dativ Singular': 'Abend',
'Genitiv Plural': 'Abende',
'Genitiv Singular': 'Abends',
'Genus': 'm',
'Nominativ Plural': 'Abende',
'Nominativ Singular': 'Abend'},
page_id=5719,
index=0,
title='Abend',
wikitext=None)
Record(lemma='Abend',
inflected=False,
syllables=['Abend'],
ipa=['ˈaːbn̩t'],
rhymes=['aːbn̩t'],
pos={'Substantiv': ['Nachname']},
lang='Deutsch',
lang_code='de',
flexion=None,
page_id=5719,
index=1,
title='Abend',
wikitext=None)
Record(lemma='Abend',
inflected=False,
syllables=['Abend'],
ipa=['ˈaːbn̩t', 'ˈaːbm̩t'],
rhymes=['aːbn̩t'],
pos={'Substantiv': ['Toponym']},
lang='Deutsch',
lang_code='de',
flexion=None,
page_id=5719,
index=2,
title='Abend',
wikitext=None)
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - Change
wiktionary_de_parser/run.py
to your needs. - Run
poetry run python wiktionary_de_parser/run.py
to run the parser. Orpoetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for wiktionary_de_parser-0.10.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed4c9ac9e147680889f05c764177abfaafa7a8855629e8e0d211d33b6b5cac5f |
|
MD5 | 1a89ed1886f83b608f00d5999cc61838 |
|
BLAKE2b-256 | 012f10c3b752311472b95e83fb97612b3925ec9e8430afb999776647035e89d6 |
Close
Hashes for wiktionary_de_parser-0.10.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fbd7bb78fee46262442ab0fe8efbe5fb02f6360ba807b11d1a93aedefc5103e |
|
MD5 | 53a506bf5041a00d9b701f1f534f55af |
|
BLAKE2b-256 | e8f9656d1ac5c0ac899eb876077e88395059d5906de97f02335ce0bd32bed70a |