Extracts data from German Wiktionary dump files.
Project description
wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
Features
- Extracts flexion tables, IPA transcriptions, language, genus, lemma, part of speech information (basic) and syllables of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
Installation
pip install wiktionary-de-parser
Or with Poetry:
poetry add wiktionary-de-parser
Usage
from bz2 import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)
for record in Parser(bz_file):
if record.lang_code != 'de':
continue
# do stuff with 'record'
Note: In this example we load a compressed Wiktionary dump file that was obtained from here.
Output
Example output for the page "Abend":
Record(lemma='Abend',
inflected=False,
syllables=['Abend'],
ipa=['ˈaːbn̩t', 'ˈaːbm̩t'],
rhymes=['aːbn̩t'],
pos={'Substantiv': []},
lang='Deutsch',
lang_code='de',
flexion={'Akkusativ Plural': 'Abende',
'Akkusativ Singular': 'Abend',
'Dativ Plural': 'Abenden',
'Dativ Singular': 'Abend',
'Genitiv Plural': 'Abende',
'Genitiv Singular': 'Abends',
'Genus': 'm',
'Nominativ Plural': 'Abende',
'Nominativ Singular': 'Abend'},
page_id=5719,
index=0,
title='Abend',
wikitext=None)
Record(lemma='Abend',
inflected=False,
syllables=['Abend'],
ipa=['ˈaːbn̩t'],
rhymes=['aːbn̩t'],
pos={'Substantiv': ['Nachname']},
lang='Deutsch',
lang_code='de',
flexion=None,
page_id=5719,
index=1,
title='Abend',
wikitext=None)
Record(lemma='Abend',
inflected=False,
syllables=['Abend'],
ipa=['ˈaːbn̩t', 'ˈaːbm̩t'],
rhymes=['aːbn̩t'],
pos={'Substantiv': ['Toponym']},
lang='Deutsch',
lang_code='de',
flexion=None,
page_id=5719,
index=2,
title='Abend',
wikitext=None)
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - Change
wiktionary_de_parser/run.py
to your needs. - Run
poetry run python wiktionary_de_parser/run.py
to run the parser. Orpoetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for wiktionary_de_parser-0.10.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4636f7bc07707f74a9eca444eded650da772e7828b7b84083cf5697f21853f6d |
|
MD5 | 1a6a00c27d92aa866fc415c38b8d2e81 |
|
BLAKE2b-256 | c4056f8815b0c5033891465fc3f8d90dcab185b3ab1a9292e78a741d8a71b86b |
Close
Hashes for wiktionary_de_parser-0.10.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a51f36ea04abdb3027cbc4aa4842a446917556e4701f31e4fdcb4d2bd50a1aff |
|
MD5 | c27c52eba28192c622d3e30846b9fd4a |
|
BLAKE2b-256 | 85a010d58f5dd5ac5296ddbacd1ffbfb3325fb134b0f7aef1f3c141e1ccf610c |