Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 🚀
Project description
wiktionary-de-parser
This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods.
Installation
pip install wiktionary-de-parser
Features
- Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
- Allows you to add your own extraction methods (pass them as argument)
- Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages)
Usage
from bz2 import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)
for record in Parser(bz_file):
if 'lang_code' not in record or record['lang_code'] != 'de':
continue
# do stuff with 'record'
Note: In this example we load a compressed Wiktionary dump file that was obtained from here.
Adding new extraction methods
An extraction method takes the following arguments:
title
(string): The title of the current Wiktionary pagetext
(string): The Wikitext of the current word entry/sectioncurrent_record
(Dict): A dictionary with all values of the current iteration (e. g.current_record['lang_code']
)
It must return a Dict
with the results or False
if the record was processed unsuccesfully.
# Create a new extraction method
def my_method(title, text, current_record):
# do stuff
return {'my_field': my_data} if my_data else False
# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz_file, custom_methods=[my_method]):
print(record['my_field'])
Output
Example output for the word "Abend":
{'flexion': {'Akkusativ Plural': 'Abende',
'Akkusativ Singular': 'Abend',
'Dativ Plural': 'Abenden',
'Dativ Singular': 'Abend',
'Genitiv Plural': 'Abende',
'Genitiv Singular': 'Abends',
'Genus': 'm',
'Nominativ Plural': 'Abende',
'Nominativ Singular': 'Abend'},
'inflected': False,
'ipa': ['ˈaːbn̩t', 'ˈaːbm̩t'],
'lang': 'Deutsch',
'lang_code': 'de',
'lemma': 'Abend',
'pos': {'Substantiv': []},
'rhymes': ['aːbn̩t'],
'syllables': ['Abend'],
'title': 'Abend'}
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - Change
wiktionary_de_parser/run.py
to your needs. - Run
poetry run python wiktionary_de_parser/run.py
to run the parser. Orpoetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wiktionary-de-parser-0.9.3.tar.gz
(15.5 kB
view hashes)
Built Distribution
Close
Hashes for wiktionary-de-parser-0.9.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46c598d4748aadc6fb8e014313c284e2d29c7b1f82dcaa4bb66df97c329c621f |
|
MD5 | 42f793d9469dc8f6ad9a7d9dc7e3e800 |
|
BLAKE2b-256 | f834811a1b7a48e6befa3986cb4b9926328fb61c8c442aa22cfa6cc8243ed18b |
Close
Hashes for wiktionary_de_parser-0.9.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 00a0c880996a2fd2cf95bf096b4fc9f7a15d893d76dca6a88a25135737599eee |
|
MD5 | 6ad6d88345269b266a552a7c69da2116 |
|
BLAKE2b-256 | 5dea8efa7f246cacbabd97e7e26f356f15f3ab52d836b9e78850edabb7aaf853 |