Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 🚀
Project description
wiktionary_de_parser
wiktionary_de_parser
is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods.
Installation
pip3 install wiktionary_de_parser
Features
- comes with preset extraction methods for:
- flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
- allows you to add your own extraction methods (pass them as argument)
- data values are normalized and cleaned from obsolete Wikitext markup
- yields per section, not per page (a word can have multiple meanings, which is why some Wiktionary pages have multiple 'sections')
Usage
Import wiktionary_de_parser
like this:
from bz2 import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz = BZ2File(bzfile_path)
for record in Parser(bz):
if 'lang_code' not in record or record['lang_code'] != 'de':
continue
# do stuff with 'record'
Note: in this example we use BZ2File to read a compressed Wiktionary dump file. The Wiktionary dump file is obtained from here.
Adding new extraction methods
All extraction methods must return a Dict()
and accept the following arguments:
title
(string): The title of the current Wiktionary pagetext
(string): The Wikitext of the current word entry/sectioncurrent_record
(Dict): A dictionary with all values of the current iteration (e. g.current_record['lang_code']
)
# Create a new extraction method
def my_method(title, text, current_record):
# do stuff
return {'my_field': my_data}
# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz, custom_methods=[my_method]):
print(record['my_field'])
Sample data:
{'flexion': {'Akkusativ Plural': 'Trittbrettfahrer',
'Akkusativ Singular': 'Trittbrettfahrer',
'Dativ Plural': 'Trittbrettfahrern',
'Dativ Singular': 'Trittbrettfahrer',
'Genitiv Plural': 'Trittbrettfahrer',
'Genitiv Singular': 'Trittbrettfahrers',
'Genus': 'm',
'Nominativ Plural': 'Trittbrettfahrer',
'Nominativ Singular': 'Trittbrettfahrer'},
'inflected': False,
'ipa': ['ˈtʁɪtbʁɛtˌfaːʁɐ'],
'lang': 'Deutsch',
'lang_code': 'de',
'lemma': 'Trittbrettfahrer',
'pos': {'Substantiv': []},
'syllables': ['Tritt', 'brett', 'fah', 'rer'],
'title': 'Trittbrettfahrer',
'wikitext': '=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===\n'
'\n'
'{{Deutsch Substantiv Übersicht\n'
'|Genus=m\n'
'|Nominativ Singular=Trittbrettfahrer\n'
'|Nominativ Plural=Trittbrettfahrer\n'
'|Genitiv Singular=Trittbrettfahrers\n'
'|Genitiv Plural=Trittbrettfahrer\n'
'|Dativ Singular=Trittbrettfahrer\n'
'|Dativ Plural=Trittbrettfahrern\n'
'|Akkusativ Singular=Trittbrettfahrer\n'
'|Akkusativ Plural=Trittbrettfahrer\n'
'}}\n'
'\n'
'{{Worttrennung}}\n'
':Tritt·brett·fah·rer, {{Pl.}} Tritt·brett·fah·rer\n'
'\n'
'{{Aussprache}}\n'
':{{IPA}} {{Lautschrift|ˈtʁɪtbʁɛtˌfaːʁɐ}}\n'
':{{Hörbeispiele}} {{Audio|}}\n'
'\n'
'{{Bedeutungen}}\n'
':[1] Person, die ohne [[Anstrengung]] an Vorteilen teilhaben '
'will\n'
'\n'
'{{Herkunft}}\n'
':[[Determinativkompositum]] aus den Substantiven '
"''[[Trittbrett]]'' und ''[[Fahrer]]''\n"
'\n'
'{{Weibliche Wortformen}}\n'
':[1] [[Trittbrettfahrerin]]\n'
'\n'
'{{Beispiele}}\n'
':[1] „Bleibt schließlich noch das Problem der '
"''Trittbrettfahrer,'' die sich ohne Versicherung aus "
'Nachlässigkeit in das soziale Netz abgleiten '
'lassen.“<ref>{{Internetquelle|url=http://books.google.se/books?id=VjLq84xNpfMC&pg=PA446&dq=trittbrettfahrer&hl=de&sa=X&ei=8AztU4aVJYq_ygOd1oKIDA&ved=0CEEQ6AEwBjgK#v=onepage&q=trittbrettfahrer&f=false|titel=Öffentliche '
'Finanzen in der Demokratie: Eine Einführung, Charles B. '
'Blankart|zugriff=2014-08-14}}</ref>\n'
'\n'
'{{Wortbildungen}}\n'
':[1] [[Trittbrettfahrer-Problem]]\n'
'\n'
'==== {{Übersetzungen}} ====\n'
'{{Ü-Tabelle|Ü-links=\n'
'*{{en}}: [1] {{Ü|en|free rider}}\n'
'*{{fi}}: [1] {{Ü|fi|siipeilijä}}, {{Ü|fi|vapaamatkustaja}}\n'
'*{{fr}}: [1] {{Ü|fr|profiteur}}\n'
'|Ü-rechts=\n'
'*{{it}}: [1] {{Ü|it|scroccone}} {{m}}\n'
'*{{es}}: [1] {{Ü|es|}}\n'
'}}\n'
'\n'
'{{Referenzen}}\n'
':[1] {{Wikipedia|Trittbrettfahrer}}\n'
':[*] {{Ref-DWDS|Trittbrettfahrer}}\n'
':[*] {{Ref-Canoo|Trittbrettfahrer}}\n'
':[1] {{Ref-UniLeipzig|Trittbrettfahrer}}\n'
':[1] {{Ref-FreeDictionary|Trittbrettfahrer}}\n'
'\n'
'{{Quellen}}'}
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wiktionary_de_parser-0.8.4.tar.gz
(16.8 kB
view hashes)
Built Distribution
Close
Hashes for wiktionary_de_parser-0.8.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e04e2ff61461e7a8a36fda8caa435bd421cddfcaa207a529f46e79913a0be025 |
|
MD5 | 0b6094cf1ea3ce61867b79a58cb5238a |
|
BLAKE2b-256 | 6fc2175921505ebae6e8e76b23bd2e466ff177efc2c717247f1cf4232cf1d242 |
Close
Hashes for wiktionary_de_parser-0.8.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8f34c2ffdda9e27ecf7d5ac3ef77e4d119546aca31f62c7cbe5e36349321101 |
|
MD5 | 8b8c76ab2f8178d7abdd41e894b047f3 |
|
BLAKE2b-256 | e5f1a2245a6bbecb097fe04ee483b3a1b383ec50961543811bb9cf46daae23e1 |