Skip to main content

German language support for TextBlob.

Project description

Latest version Travis-CI Number of PyPI downloads

German language support for TextBlob by Steven Loria.

This python package is being developed as a TextBlob Language Extension. See Extension Guidelines for details.

Features

  • All directly accessible textblob_de classes (e.g. Sentence() or Word()) are initialized with default models for German

  • Properties or methods that do not yet work for German raise a NotImplementedError

  • German sentence boundary detection and tokenization (NLTKPunktTokenizer)

  • Consistent use of specified tokenizer for all tools (NLTKPunktTokenizer or PatternTokenizer)

  • Part-of-speech tagging (PatternTagger) with keyword include_punc=True (defaults to False)

  • Parsing (PatternParser) with all pattern keywords, plus pprint=True (defaults to False)

  • Noun Phrase Extraction (PatternParserNPExtractor)

  • Lemmatization (PatternParserLemmatizer)

  • Polarity detection (PatternAnalyzer) - Still EXPERIMENTAL, does not yet have information on subjectivity

  • NEW: Full pattern.text.de API support on Python3

  • Supports Python 2 and 3

  • See working features overview for details

Installing/Upgrading

$ pip install -U textblob-de
$ python -m textblob.download_corpora

Or the latest development release (apparently this does not always work on Windows see issues #1744/5 for details):

$ pip install -U git+https://github.com/markuskiller/textblob-de.git@dev
$ python -m textblob.download_corpora

Usage

>>> from textblob_de import TextBlobDE as TextBlob
>>> text = '''Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag.
Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen. Aber leider
habe ich nur noch EUR 18.50 in meiner Brieftasche.'''
>>> blob = TextBlob(text)
>>> blob.sentences
[Sentence("Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag."),
 Sentence("Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen."),
 Sentence("Aber leider habe ich nur noch EUR 18.50 in meiner Brieftasche.")]
>>> blob.tokens
WordList(['Heute', 'ist', 'der', '3.', 'Mai', ...]
>>> blob.tags
[('Heute', 'RB'), ('ist', 'VB'), ('der', 'DT'), ('3.', 'LS'), ('Mai', 'NN'),
('2014', 'CD'), ...]
# Default: Only noun_phrases that consist of two or more meaningful parts are displayed.
# Not perfect, but a start (relies heavily on parser accuracy)
>>> blob.noun_phrases
WordList(['Mai 2014', 'Dr. Meier', 'seinen 43. Geburtstag', 'Kuchen einzukaufen',
'meiner Brieftasche'])
>>> blob = TextBlob("Das Auto ist sehr schön.")
>>> blob.parse()
'Das/DT/B-NP/O Auto/NN/I-NP/O ist/VB/B-VP/O sehr/RB/B-ADJP/O schön/JJ/I-ADJP/O'
>>> from textblob_de import PatternParser
>>> blob = TextBlobDE(u"Das ist ein schönes Auto.", parser=PatternParser(pprint=True, lemmata=True))
>>> blob.parse()
#          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMM
#
#       Das   DT     -       -      -      -      das
#       ist   VB     VP      -      -      -      sein
#       ein   DT     NP      -      -      -      ein
#   schönes   JJ     NP ^    -      -      -      schö
#      Auto   NN     NP ^    -      -      -      auto
#         .   .      -       -      -      -      .
>>> from textblob_de import PatternTagger
>>> blob = TextBlob(text, pos_tagger=PatternTagger(include_punc=True))
[('Das', 'DT'), ('Auto', 'NN'), ('ist', 'VB'), ('sehr', 'RB'), ('schön', 'JJ'), ('.', '.')]
>>> blob = TextBlob("Das Auto ist sehr schön.")
>>> blob.sentiment
(1.0, 0.0)
>>> blob = TextBlob("Das ist ein hässliches Auto.")
>>> blob.sentiment
(-1.0, 0.0)
>>> blob.words.lemmatize()
WordList(['das', 'sein', 'ein', 'hässlich', 'Auto'])
>>> from textblob_de.lemmatizers import PatternParserLemmatizer
>>> _lemmatizer = PatternParserLemmatizer()
>>> _lemmatizer.lemmatize("Das ist ein hässliches Auto.")
[('das', 'DT'), ('sein', 'VB'), ('ein', 'DT'), ('hässlich', 'JJ'), ('Auto', 'NN')]

Access to pattern API in Python3

>>> from textblob_de.packages import pattern_de as pd
>>> print(pd.attributive("neugierig", gender=pd.FEMALE, role=pd.INDIRECT, article="die"))
neugierigen

Requirements

  • Python >= 2.6 or >= 3.3

TODO

  • TextBlob Extension: textblob-rftagger (wrapper class for RFTagger)

  • TextBlob Extension: textblob-cmd (command-line wrapper for TextBlob, basically TextBlob for files

  • TextBlob Extension: textblob-stanfordparser (wrapper class for StanfordParser via NLTK)

  • TextBlob Extension: textblob-berkeleyparser (wrapper class for BerkeleyParser)

  • TextBlob Extension: textblob-sent-align (sentence alignment for parallel TextBlobs)

  • TextBlob Extension: textblob-converters (various input and output conversions)

  • Additional PoS tagging options, e.g. NLTK tagging (NLTKTagger)

  • Improve noun phrase extraction (e.g. based on RFTagger output)

  • Improve sentiment analysis (find suitable subjectivity scores)

  • Improve functionality of Sentence() and Word() objects

  • Adapt more tests from textblob main package (esp. for TextBlobDE() in test_blob.py)

License

MIT licensed. See the bundled LICENSE file for more details.

Changelog

0.2.4 (04/08/2014)

  • Major internal refactoring (but no backwards-incompatible API changes) with the aim of restoring complete compatibility to original pattern>=2.6 library on Python2

  • Separation of textblob and pattern code

  • On Python2 the vendorized version of pattern.text.de is only used, if original is not installed (same as nltk)

  • Made pattern.de.pprint function and all parser keywords accessible to customise parser output

  • Access to complete pattern.text.de API on Python2 and Python3 from textblob_de.packages import pattern_de as pd

  • tox passed on all major platforms (Win/Linux/OSX)

0.2.3 (26/07/2014)

  • Lemmatizer: PatternParserLemmatizer() extracts lemmata from Parser output

  • Improved polarity analysis through look-up of lemmatised word forms

0.2.2 (22/07/2014)

  • Option: Include punctuation in tags/pos_tags properties (b = TextBlobDE(text, tagger=PatternTagger(include_punc=True)))

  • Added BlobberDE() class initialized with German models

  • TextBlobDE(), Sentence(), WordList() and Word() classes are now all initialized with German models

  • Restored complete API compatibility with textblob.tokenizers module of textblob main package

0.2.1 (20/07/2014)

  • Noun Phrase Extraction: PatternParserNPExtractor() extracts NPs from Parser output

  • Refactored the way TextBlobDE() passes on arguments and keyword arguments to individual tools

  • Backwards-incompatible: Deprecate parser_show_lemmata=True keyword in TextBlob(). Use parser=PatternParser(lemmata=True) instead.

0.2.0 (18/07/2014)

  • vastly improved tokenization (NLTKPunktTokenizer and PatternTokenizer with tests)

  • consistent use of specified tokenizer for all tools

  • TextBlobDE with initialized default models for German

  • Parsing (PatternParser) plus test_parsers.py

  • EXPERIMENTAL implementation of Polarity detection (PatternAnalyzer)

  • first attempt at extracting German Polarity clues into de-sentiment.xml

  • tox tests passing for py26, py27, py33 and py34

0.1.3 (09/07/2014)

  • First release on PyPI

0.1.0 - 0.1.2 (09/07/2014)

  • First release on github

  • A number of experimental releases for testing purposes

  • Adapted version badges, tests & travis-ci config

  • Code adapted from sample extension textblob-fr

  • Language specific linguistic resources copied from pattern-de

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textblob-de-0.2.4.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

textblob_de-0.2.4-py2.py3-none-any.whl (1.0 MB view details)

Uploaded Python 2 Python 3

File details

Details for the file textblob-de-0.2.4.tar.gz.

File metadata

  • Download URL: textblob-de-0.2.4.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for textblob-de-0.2.4.tar.gz
Algorithm Hash digest
SHA256 0df1b904e5ba10473ca8334791b1355ff60e0873b6a5c24b133d735a12140029
MD5 ad7d1df341a070176a874695d31bd1da
BLAKE2b-256 b9e140621cd4cc32253371d6e85b4c2b9089e185dfd985b3bc74ad2a30b8456a

See more details on using hashes here.

File details

Details for the file textblob_de-0.2.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for textblob_de-0.2.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c9dd30ee80ae7ef6dc6ebc6ba3a2167474df2404713deaeca72b313df1cab0ce
MD5 52e4b2506e9bd592e813b5dc6f2e1bd0
BLAKE2b-256 b313d769ff3e2b1c2a57f6591e58165af8612e8ca6d6128041a3107e1987865c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page