Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 🚀
Project description
wiktionary-de-parser
This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods.
Installation
pip install wiktionary-de-parser
Features
- Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
- Allows you to add your own extraction methods (pass them as argument)
- Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages)
Usage
from bz2 import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)
for record in Parser(bz_file):
if 'lang_code' not in record or record['lang_code'] != 'de':
continue
# do stuff with 'record'
Note: In this example we load a compressed Wiktionary dump file that was obtained from here.
Adding new extraction methods
An extraction method takes the following arguments:
title
(string): The title of the current Wiktionary pagetext
(string): The Wikitext of the current word entry/sectioncurrent_record
(Dict): A dictionary with all values of the current iteration (e. g.current_record['lang_code']
)
It must return a Dict
with the results or False
if the record was processed unsuccesfully.
# Create a new extraction method
def my_method(title, text, current_record):
# do stuff
return {'my_field': my_data} if my_data else False
# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz_file, custom_methods=[my_method]):
print(record['my_field'])
Output
Example output for the word "Trittbrettfahrer":
{'flexion': {'Akkusativ Plural': 'Trittbrettfahrer',
'Akkusativ Singular': 'Trittbrettfahrer',
'Dativ Plural': 'Trittbrettfahrern',
'Dativ Singular': 'Trittbrettfahrer',
'Genitiv Plural': 'Trittbrettfahrer',
'Genitiv Singular': 'Trittbrettfahrers',
'Genus': 'm',
'Nominativ Plural': 'Trittbrettfahrer',
'Nominativ Singular': 'Trittbrettfahrer'},
'inflected': False,
'ipa': ['ˈtʁɪtbʁɛtˌfaːʁɐ'],
'lang': 'Deutsch',
'lang_code': 'de',
'lemma': 'Trittbrettfahrer',
'pos': {'Substantiv': []},
'syllables': ['Tritt', 'brett', 'fah', 'rer'],
'title': 'Trittbrettfahrer'}
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - Change
wiktionary_de_parser/run.py
to your needs. - Run
poetry run python wiktionary_de_parser/run.py
to run the parser. Orpoetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wiktionary-de-parser-0.9.0.tar.gz
(15.1 kB
view hashes)
Built Distribution
Close
Hashes for wiktionary-de-parser-0.9.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed08e80e34bf9d2f979e57fc6bc9217bb7a9ca30b950d40d38cdf2543a60a555 |
|
MD5 | 23c51bce6893a49ed16aa6974ae0aab5 |
|
BLAKE2b-256 | 44fd0b489ecac7868934413d69d44c64b5195bc600701e8e021076adeb7c86bc |
Close
Hashes for wiktionary_de_parser-0.9.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94bce61c484d951c603823ec995f61a723a813a1776760c93c63b91671b0c4e3 |
|
MD5 | 5cf9b5a6c5fe2bf6b1a8ee1e23ee0d73 |
|
BLAKE2b-256 | 757c7b0dfe06d25920af4106b40c1994a356cf3f2c91210d5d420c5f9f9420d7 |