Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 🚀
Project description
wiktionary-de-parser
This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods.
Installation
pip install wiktionary-de-parser
Features
- Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
- Allows you to add your own extraction methods (pass them as argument)
- Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages)
Usage
from bz2 import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)
for record in Parser(bz_file):
if 'lang_code' not in record or record['lang_code'] != 'de':
continue
# do stuff with 'record'
Note: In this example we load a compressed Wiktionary dump file that was obtained from here.
Adding new extraction methods
An extraction method takes the following arguments:
title
(string): The title of the current Wiktionary pagetext
(string): The Wikitext of the current word entry/sectioncurrent_record
(Dict): A dictionary with all values of the current iteration (e. g.current_record['lang_code']
)
It must return a Dict
with the results or False
if the record was processed unsuccesfully.
# Create a new extraction method
def my_method(title, text, current_record):
# do stuff
return {'my_field': my_data} if my_data else False
# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz_file, custom_methods=[my_method]):
print(record['my_field'])
Output
Example output for the word "Abend":
{'flexion': {'Akkusativ Plural': 'Abende',
'Akkusativ Singular': 'Abend',
'Dativ Plural': 'Abenden',
'Dativ Singular': 'Abend',
'Genitiv Plural': 'Abende',
'Genitiv Singular': 'Abends',
'Genus': 'm',
'Nominativ Plural': 'Abende',
'Nominativ Singular': 'Abend'},
'inflected': False,
'ipa': ['ˈaːbn̩t', 'ˈaːbm̩t'],
'lang': 'Deutsch',
'lang_code': 'de',
'lemma': 'Abend',
'pos': {'Substantiv': []},
'rhymes': ['aːbn̩t'],
'syllables': ['Abend'],
'title': 'Abend'}
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - Change
wiktionary_de_parser/run.py
to your needs. - Run
poetry run python wiktionary_de_parser/run.py
to run the parser. Orpoetry run pytest
to run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wiktionary-de-parser-0.9.4.tar.gz
(15.6 kB
view hashes)
Built Distribution
Close
Hashes for wiktionary-de-parser-0.9.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | aad97623cfbeb54a961c2710530167de062b54de5d80749fd5be2b5a3d7d09fd |
|
MD5 | f36517875d713a609fbe3d4a64864840 |
|
BLAKE2b-256 | aa1404d2ccfd1d46211aee4e30c89306c356763517bec1f0a4e55de12c402fd4 |
Close
Hashes for wiktionary_de_parser-0.9.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85cafccd7dc4db0e12f4ed60172034c4730e08835023a257477e1a5fe6d8fd56 |
|
MD5 | f1700a03d5f27facb8dc1a111b71e449 |
|
BLAKE2b-256 | f7a14584e125395868d8fc19bb52bdce0dace745efd094e19283bdeab6211992 |