Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 🚀
Project description
wiktionary_de_parser
wiktionary_de_parser is a Python module to extract data from German Wiktionary XML files. It allows you to add your own extraction methods.
Requirements
- Python 3.7 (might work with other 3.+ versions, but not tested)
Features
- comes with preset extraction methods for:
- flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
- allows you to add your own extraction methods (pass them as argument)
- data values are normalized and cleaned from obsolete Wikitext markup
- yields per section, not per page (a word can have multiple meanings, which is why some Wiktionary pages have multiple 'sections')
Usage
- Install via
pip3 install wiktionary_de_parser. - Import
wiktionary_de_parserlike this:
from bz2file import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = 'C:/Users/Gregor/Downloads/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz = BZ2File(bzfile_path)
for record in Parser(bz):
if 'language' not in record or record['language'] != 'Deutsch':
continue
# do stuff with 'record'
Note: in this example we use BZ2File to read a compressed Wiktionary dump file. The Wiktionary dump file is obtained from here.
Adding new extraction methods
All extraction methods must return a Dict() and accept the following arguments:
title(string): The title of the current Wiktionary pagetext(string): The Wikitext of the current word entry/sectioncurrent_record(Dict): A dictionary with all values of the current iteration (e. g.current_record['language'])
# Create a new extraction method
def my_method(title, text, current_record):
# do stuff
return {'my_field': my_data}
# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz, custom_methods=[my_method]):
print(record['my_field'])
Sample data:
{'flexion': {'Akkusativ Plural': 'Trittbrettfahrer',
'Akkusativ Singular': 'Trittbrettfahrer',
'Dativ Plural': 'Trittbrettfahrern',
'Dativ Singular': 'Trittbrettfahrer',
'Genitiv Plural': 'Trittbrettfahrer',
'Genitiv Singular': 'Trittbrettfahrers',
'Genus': 'm',
'Nominativ Plural': 'Trittbrettfahrer',
'Nominativ Singular': 'Trittbrettfahrer'},
'inflected': False,
'ipa': ['ˈtʁɪtbʁɛtˌfaːʁɐ'],
'language': 'Deutsch',
'lemma': 'Trittbrettfahrer',
'pos': {'Substantiv': []},
'syllables': ['Tritt', 'brett', 'fah', 'rer'],
'title': 'Trittbrettfahrer',
'wikitext': '=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===\n'
'\n'
'{{Deutsch Substantiv Übersicht\n'
'|Genus=m\n'
'|Nominativ Singular=Trittbrettfahrer\n'
'|Nominativ Plural=Trittbrettfahrer\n'
'|Genitiv Singular=Trittbrettfahrers\n'
'|Genitiv Plural=Trittbrettfahrer\n'
'|Dativ Singular=Trittbrettfahrer\n'
'|Dativ Plural=Trittbrettfahrern\n'
'|Akkusativ Singular=Trittbrettfahrer\n'
'|Akkusativ Plural=Trittbrettfahrer\n'
'}}\n'
'\n'
'{{Worttrennung}}\n'
':Tritt·brett·fah·rer, {{Pl.}} Tritt·brett·fah·rer\n'
'\n'
'{{Aussprache}}\n'
':{{IPA}} {{Lautschrift|ˈtʁɪtbʁɛtˌfaːʁɐ}}\n'
':{{Hörbeispiele}} {{Audio|}}\n'
'\n'
'{{Bedeutungen}}\n'
':[1] Person, die ohne [[Anstrengung]] an Vorteilen teilhaben '
'will\n'
'\n'
'{{Herkunft}}\n'
':[[Determinativkompositum]] aus den Substantiven '
"''[[Trittbrett]]'' und ''[[Fahrer]]''\n"
'\n'
'{{Weibliche Wortformen}}\n'
':[1] [[Trittbrettfahrerin]]\n'
'\n'
'{{Beispiele}}\n'
':[1] „Bleibt schließlich noch das Problem der '
"''Trittbrettfahrer,'' die sich ohne Versicherung aus "
'Nachlässigkeit in das soziale Netz abgleiten '
'lassen.“<ref>{{Internetquelle|url=http://books.google.se/books?id=VjLq84xNpfMC&pg=PA446&dq=trittbrettfahrer&hl=de&sa=X&ei=8AztU4aVJYq_ygOd1oKIDA&ved=0CEEQ6AEwBjgK#v=onepage&q=trittbrettfahrer&f=false|titel=Öffentliche '
'Finanzen in der Demokratie: Eine Einführung, Charles B. '
'Blankart|zugriff=2014-08-14}}</ref>\n'
'\n'
'{{Wortbildungen}}\n'
':[1] [[Trittbrettfahrer-Problem]]\n'
'\n'
'==== {{Übersetzungen}} ====\n'
'{{Ü-Tabelle|Ü-links=\n'
'*{{en}}: [1] {{Ü|en|free rider}}\n'
'*{{fi}}: [1] {{Ü|fi|siipeilijä}}, {{Ü|fi|vapaamatkustaja}}\n'
'*{{fr}}: [1] {{Ü|fr|profiteur}}\n'
'|Ü-rechts=\n'
'*{{it}}: [1] {{Ü|it|scroccone}} {{m}}\n'
'*{{es}}: [1] {{Ü|es|}}\n'
'}}\n'
'\n'
'{{Referenzen}}\n'
':[1] {{Wikipedia|Trittbrettfahrer}}\n'
':[*] {{Ref-DWDS|Trittbrettfahrer}}\n'
':[*] {{Ref-Canoo|Trittbrettfahrer}}\n'
':[1] {{Ref-UniLeipzig|Trittbrettfahrer}}\n'
':[1] {{Ref-FreeDictionary|Trittbrettfahrer}}\n'
'\n'
'{{Quellen}}'}
Vendor packages
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wiktionary_de_parser-0.7.8.tar.gz.
File metadata
- Download URL: wiktionary_de_parser-0.7.8.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.21.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc9baf9e310e108bd4f6c4cf19ce42d6d1e37c0fee0e7079dcc82df6e93c4b0d
|
|
| MD5 |
9b8d32fa4386532311224e885993d5d2
|
|
| BLAKE2b-256 |
2c8b05723459dfa0ec24f97ad9141f5999353971d303f4ee99738f0893bdf483
|
File details
Details for the file wiktionary_de_parser-0.7.8-py3-none-any.whl.
File metadata
- Download URL: wiktionary_de_parser-0.7.8-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.21.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa54de0e9b45cd69517a12fd691e9b18cc3046083cbf6add8f1db16631058ea9
|
|
| MD5 |
f9ce91ece3b3b5cafd156ed6b847f4ac
|
|
| BLAKE2b-256 |
a937e667f250b26c149f89ddd1fad2645d868252bbc40888280351250a0aeafa
|