Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 🚀
Project description
wiktionary_de_parser
wiktionary_de_parser
is a Python module to extract data from German Wiktionary XML files. It allows you to add your own extraction methods.
Requirements
- Python 3.7 (might work with other 3.+ versions, but not tested)
Features
- comes with preset extraction methods for:
- flexion tables, genus, IPA, language, lemma, part of speech, syllables, raw Wikitext
- allows you to add your own extraction methods (pass them as argument)
- data values are normalized and cleaned from obsolete Wikitext markup
- yields per section, not per page (a word can have multiple meanings, which is why some Wiktionary pages have multiple 'sections')
Usage
- Install via
pip3 install wiktionary_de_parser
. - Import
wiktionary_de_parser
like this:
from bz2file import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = 'C:/Users/Gregor/Downloads/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz = BZ2File(bzfile_path)
for record in Parser(bz):
if 'language' not in record or record['language'] != 'Deutsch':
continue
# do stuff with 'record'
Note: in this example we use BZ2File to read a compressed Wiktionary dump file. The Wiktionary dump file is obtained from here.
Adding new extraction methods
All extraction methods must return a Dict()
and accept the following arguments:
title
(string): The title of the current Wiktionary pagetext
(string): The Wikitext of the current word entry/sectioncurrent_record
(Dict): A dictionary with all values of the current iteration (e. g.current_record['language']
)
# Create a new extraction method
def my_method(title, text, current_record):
# do stuff
return {'my_field': my_data}
# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz, custom_methods=[my_method]):
print(record['my_field'])
Sample data:
{'flexion': {'Akkusativ Plural': 'Trittbrettfahrer',
'Akkusativ Singular': 'Trittbrettfahrer',
'Dativ Plural': 'Trittbrettfahrern',
'Dativ Singular': 'Trittbrettfahrer',
'Genitiv Plural': 'Trittbrettfahrer',
'Genitiv Singular': 'Trittbrettfahrers',
'Genus': 'm',
'Nominativ Plural': 'Trittbrettfahrer',
'Nominativ Singular': 'Trittbrettfahrer'},
'inflected': False,
'ipa': 'ˈtʁɪtbʁɛtˌfaːʁɐ',
'language': 'Deutsch',
'lemma': 'Trittbrettfahrer',
'pos': {'Substantiv': []},
'syllables': ['Tritt', 'brett', 'fah', 'rer'],
'title': 'Trittbrettfahrer',
'wikitext': '=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===\n'
'\n'
'{{Deutsch Substantiv Übersicht\n'
'|Genus=m\n'
'|Nominativ Singular=Trittbrettfahrer\n'
'|Nominativ Plural=Trittbrettfahrer\n'
'|Genitiv Singular=Trittbrettfahrers\n'
'|Genitiv Plural=Trittbrettfahrer\n'
'|Dativ Singular=Trittbrettfahrer\n'
'|Dativ Plural=Trittbrettfahrern\n'
'|Akkusativ Singular=Trittbrettfahrer\n'
'|Akkusativ Plural=Trittbrettfahrer\n'
'}}\n'
'\n'
'{{Worttrennung}}\n'
':Tritt·brett·fah·rer, {{Pl.}} Tritt·brett·fah·rer\n'
'\n'
'{{Aussprache}}\n'
':{{IPA}} {{Lautschrift|ˈtʁɪtbʁɛtˌfaːʁɐ}}\n'
':{{Hörbeispiele}} {{Audio|}}\n'
'\n'
'{{Bedeutungen}}\n'
':[1] Person, die ohne [[Anstrengung]] an Vorteilen teilhaben '
'will\n'
'\n'
'{{Herkunft}}\n'
':[[Determinativkompositum]] aus den Substantiven '
"''[[Trittbrett]]'' und ''[[Fahrer]]''\n"
'\n'
'{{Weibliche Wortformen}}\n'
':[1] [[Trittbrettfahrerin]]\n'
'\n'
'{{Beispiele}}\n'
':[1] „Bleibt schließlich noch das Problem der '
"''Trittbrettfahrer,'' die sich ohne Versicherung aus "
'Nachlässigkeit in das soziale Netz abgleiten '
'lassen.“<ref>{{Internetquelle|url=http://books.google.se/books?id=VjLq84xNpfMC&pg=PA446&dq=trittbrettfahrer&hl=de&sa=X&ei=8AztU4aVJYq_ygOd1oKIDA&ved=0CEEQ6AEwBjgK#v=onepage&q=trittbrettfahrer&f=false|titel=Öffentliche '
'Finanzen in der Demokratie: Eine Einführung, Charles B. '
'Blankart|zugriff=2014-08-14}}</ref>\n'
'\n'
'{{Wortbildungen}}\n'
':[1] [[Trittbrettfahrer-Problem]]\n'
'\n'
'==== {{Übersetzungen}} ====\n'
'{{Ü-Tabelle|Ü-links=\n'
'*{{en}}: [1] {{Ü|en|free rider}}\n'
'*{{fi}}: [1] {{Ü|fi|siipeilijä}}, {{Ü|fi|vapaamatkustaja}}\n'
'*{{fr}}: [1] {{Ü|fr|profiteur}}\n'
'|Ü-rechts=\n'
'*{{it}}: [1] {{Ü|it|scroccone}} {{m}}\n'
'*{{es}}: [1] {{Ü|es|}}\n'
'}}\n'
'\n'
'{{Referenzen}}\n'
':[1] {{Wikipedia|Trittbrettfahrer}}\n'
':[*] {{Ref-DWDS|Trittbrettfahrer}}\n'
':[*] {{Ref-Canoo|Trittbrettfahrer}}\n'
':[1] {{Ref-UniLeipzig|Trittbrettfahrer}}\n'
':[1] {{Ref-FreeDictionary|Trittbrettfahrer}}\n'
'\n'
'{{Quellen}}'}
Vendor packages
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wiktionary_de_parser-0.7.2.tar.gz
(12.3 kB
view hashes)
Built Distribution
Close
Hashes for wiktionary_de_parser-0.7.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4bf76c17f7eb659992dc18926805f9b9e356a5034dc0fdc46c5ddc464f3c265 |
|
MD5 | 3a1209cb2585b090f0e8bf9fcefcdf5e |
|
BLAKE2b-256 | fa11cf2817847e23dfd160136656ec105878e99ad42c1316652bc542bd589bad |
Close
Hashes for wiktionary_de_parser-0.7.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3adfeb4246b5ba0a24d28acd50e66511a59e9c4dcd86a5049b78b01850dba161 |
|
MD5 | 3f5064c8fe4da2ae653023267928484d |
|
BLAKE2b-256 | cfe16b66e55c8f9ac8ee15197696a090c11ff7badf94c0d76758a0c91a40d958 |