Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 🚀
Project description
wiktionary-de-parser
This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods.
Installation
pip install wiktionary-de-parser
Features
- Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
- Allows you to add your own extraction methods (pass them as argument)
- Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages)
Usage
from bz2 import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)
for record in Parser(bz_file):
if 'lang_code' not in record or record['lang_code'] != 'de':
continue
# do stuff with 'record'
Note: In this example we load a compressed Wiktionary dump file that was obtained from here.
Adding new extraction methods
An extraction method takes the following arguments:
title
(string): The title of the current Wiktionary pagetext
(string): The Wikitext of the current word entry/sectioncurrent_record
(Dict): A dictionary with all values of the current iteration (e. g.current_record['lang_code']
)
It must return a Dict
with the results or False
if the record was processed unsuccesfully.
# Create a new extraction method
def my_method(title, text, current_record):
# do stuff
return {'my_field': my_data} if my_data else False
# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz_file, custom_methods=[my_method]):
print(record['my_field'])
Output
Example output for the word "Abend":
{'flexion': {'Akkusativ Plural': 'Abende',
'Akkusativ Singular': 'Abend',
'Dativ Plural': 'Abenden',
'Dativ Singular': 'Abend',
'Genitiv Plural': 'Abende',
'Genitiv Singular': 'Abends',
'Genus': 'm',
'Nominativ Plural': 'Abende',
'Nominativ Singular': 'Abend'},
'inflected': False,
'ipa': ['ˈaːbn̩t', 'ˈaːbm̩t'],
'lang': 'Deutsch',
'lang_code': 'de',
'lemma': 'Abend',
'pos': {'Substantiv': []},
'rhymes': ['aːbn̩t'],
'syllables': ['Abend'],
'title': 'Abend'}
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - Change
wiktionary_de_parser/run.py
to your needs. - Run
poetry run python wiktionary_de_parser/run.py
to run the parser. Orpoetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wiktionary-de-parser-0.9.5.tar.gz
(15.7 kB
view hashes)
Built Distribution
Close
Hashes for wiktionary-de-parser-0.9.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | aaae7daaeea75cfacd6cc92eaff09424022584859d59429da908ccbc6dcb7334 |
|
MD5 | cab9a30d254e65ef861ca91ef2a08a93 |
|
BLAKE2b-256 | d6e6d91d18aff8de3b01402413043ea9a53c83ea83d36bed6ba6c47f37be6ab8 |
Close
Hashes for wiktionary_de_parser-0.9.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b59a8bed19ebdeaac206e99944f2a2540df22520dc325b36a899ec9e73953e8b |
|
MD5 | 75d33e8709289a9173f9a8d649d8e0d0 |
|
BLAKE2b-256 | cf13c66118999f751771183385b5cf132f0356d4393d74b6fb5d0be785d152f4 |