Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 馃殌

Project description

wiktionary_de_parser

wiktionary_de_parser is a Python module to extract data from German Wiktionary XML files. It allows you to add your own extraction methods.

Requirements

  • Python 3.7 (might work with other 3.+ versions, but not tested)

Features

  • comes with preset extraction methods for:
    • flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
  • allows you to add your own extraction methods (pass them as argument)
  • data values are normalized and cleaned from obsolete Wikitext markup
  • yields per section, not per page (a word can have multiple meanings, which is why some Wiktionary pages have multiple 'sections')

Usage

  1. Install via pip3 install wiktionary_de_parser.
  2. Import wiktionary_de_parser like this:
from bz2file import BZ2File
from wiktionary_de_parser import Parser

bzfile_path = 'C:/Users/Gregor/Downloads/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz = BZ2File(bzfile_path)

for record in Parser(bz):
    if 'language' not in record or record['language'] != 'Deutsch':
      continue
    # do stuff with 'record'

Note: in this example we use BZ2File to read a compressed Wiktionary dump file. The Wiktionary dump file is obtained from here.

Adding new extraction methods

All extraction methods must return a Dict() and accept the following arguments:

  • title (string): The title of the current Wiktionary page
  • text (string): The Wikitext of the current word entry/section
  • current_record (Dict): A dictionary with all values of the current iteration (e. g. current_record['language'])
# Create a new extraction method
def my_method(title, text, current_record):
  # do stuff
  return {'my_field': my_data}

# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz, custom_methods=[my_method]):
    print(record['my_field'])

Sample data:

{'flexion': {'Akkusativ Plural': 'Trittbrettfahrer',
             'Akkusativ Singular': 'Trittbrettfahrer',
             'Dativ Plural': 'Trittbrettfahrern',
             'Dativ Singular': 'Trittbrettfahrer',
             'Genitiv Plural': 'Trittbrettfahrer',
             'Genitiv Singular': 'Trittbrettfahrers',
             'Genus': 'm',
             'Nominativ Plural': 'Trittbrettfahrer',
             'Nominativ Singular': 'Trittbrettfahrer'},
 'inflected': False,
 'ipa': ['藞t蕘瑟tb蕘蓻t藢fa藧蕘蓯'],
 'language': 'Deutsch',
 'lemma': 'Trittbrettfahrer',
 'pos': {'Substantiv': []},
 'syllables': ['Tritt', 'brett', 'fah', 'rer'],
 'title': 'Trittbrettfahrer',
 'wikitext': '=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===\n'
             '\n'
             '{{Deutsch Substantiv 脺bersicht\n'
             '|Genus=m\n'
             '|Nominativ Singular=Trittbrettfahrer\n'
             '|Nominativ Plural=Trittbrettfahrer\n'
             '|Genitiv Singular=Trittbrettfahrers\n'
             '|Genitiv Plural=Trittbrettfahrer\n'
             '|Dativ Singular=Trittbrettfahrer\n'
             '|Dativ Plural=Trittbrettfahrern\n'
             '|Akkusativ Singular=Trittbrettfahrer\n'
             '|Akkusativ Plural=Trittbrettfahrer\n'
             '}}\n'
             '\n'
             '{{Worttrennung}}\n'
             ':Tritt路brett路fah路rer, {{Pl.}} Tritt路brett路fah路rer\n'
             '\n'
             '{{Aussprache}}\n'
             ':{{IPA}} {{Lautschrift|藞t蕘瑟tb蕘蓻t藢fa藧蕘蓯}}\n'
             ':{{H枚rbeispiele}} {{Audio|}}\n'
             '\n'
             '{{Bedeutungen}}\n'
             ':[1] Person, die ohne [[Anstrengung]] an Vorteilen teilhaben '
             'will\n'
             '\n'
             '{{Herkunft}}\n'
             ':[[Determinativkompositum]] aus den Substantiven '
             "''[[Trittbrett]]'' und ''[[Fahrer]]''\n"
             '\n'
             '{{Weibliche Wortformen}}\n'
             ':[1] [[Trittbrettfahrerin]]\n'
             '\n'
             '{{Beispiele}}\n'
             ':[1] 鈥濨leibt schlie脽lich noch das Problem der '
             "''Trittbrettfahrer,'' die sich ohne Versicherung aus "
             'Nachl盲ssigkeit in das soziale Netz abgleiten '
             'lassen.鈥<ref>{{Internetquelle|url=http://books.google.se/books?id=VjLq84xNpfMC&pg=PA446&dq=trittbrettfahrer&hl=de&sa=X&ei=8AztU4aVJYq_ygOd1oKIDA&ved=0CEEQ6AEwBjgK#v=onepage&q=trittbrettfahrer&f=false|titel=脰ffentliche '
             'Finanzen in der Demokratie: Eine Einf眉hrung, Charles B. '
             'Blankart|zugriff=2014-08-14}}</ref>\n'
             '\n'
             '{{Wortbildungen}}\n'
             ':[1] [[Trittbrettfahrer-Problem]]\n'
             '\n'
             '==== {{脺bersetzungen}} ====\n'
             '{{脺-Tabelle|脺-links=\n'
             '*{{en}}: [1] {{脺|en|free rider}}\n'
             '*{{fi}}: [1] {{脺|fi|siipeilij盲}}, {{脺|fi|vapaamatkustaja}}\n'
             '*{{fr}}: [1] {{脺|fr|profiteur}}\n'
             '|脺-rechts=\n'
             '*{{it}}: [1] {{脺|it|scroccone}} {{m}}\n'
             '*{{es}}: [1] {{脺|es|}}\n'
             '}}\n'
             '\n'
             '{{Referenzen}}\n'
             ':[1] {{Wikipedia|Trittbrettfahrer}}\n'
             ':[*] {{Ref-DWDS|Trittbrettfahrer}}\n'
             ':[*] {{Ref-Canoo|Trittbrettfahrer}}\n'
             ':[1] {{Ref-UniLeipzig|Trittbrettfahrer}}\n'
             ':[1] {{Ref-FreeDictionary|Trittbrettfahrer}}\n'
             '\n'
             '{{Quellen}}'}

Vendor packages

License

MIT 漏 Gregor Weichbrodt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for wiktionary-de-parser, version 0.7.7
Filename, size File type Python version Upload date Hashes
Filename, size wiktionary_de_parser-0.7.7-py3-none-any.whl (13.8 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size wiktionary_de_parser-0.7.7.tar.gz (14.3 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page