Skip to main content

German language support for TextBlob.

Project description

Latest version Travis-CI Number of PyPI downloads

German language support for TextBlob by Steven Loria.

This python package is being developed as a TextBlob Language Extension. See Extension Guidelines for details.

Features

  • All directly accessible textblob_de classes (e.g. Sentence() or Word()) are initialized with default models for German

  • Properties or methods that do not yet work for German raise a NotImplementedError

  • German sentence boundary detection and tokenization (NLTKPunktTokenizer)

  • Consistent use of specified tokenizer for all tools (NLTKPunktTokenizer or PatternTokenizer)

  • Part-of-speech tagging (PatternTagger) with keyword include_punc=True (defaults to False)

  • Parsing (PatternParser) with all pattern keywords, plus pprint=True (defaults to False)

  • Noun Phrase Extraction (PatternParserNPExtractor)

  • Lemmatization (PatternParserLemmatizer)

  • Polarity detection (PatternAnalyzer) - Still EXPERIMENTAL, does not yet have information on subjectivity

  • NEW: Full pattern.text.de API support on Python3

  • Supports Python 2 and 3

  • See working features overview for details

Installing/Upgrading

$ pip install -U textblob-de
$ python -m textblob.download_corpora

Or the latest development release (apparently this does not always work on Windows see issues #1744/5 for details):

$ pip install -U git+https://github.com/markuskiller/textblob-de.git@dev
$ python -m textblob.download_corpora

Usage

>>> from textblob_de import TextBlobDE as TextBlob
>>> text = '''Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag.
Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen. Aber leider
habe ich nur noch EUR 18.50 in meiner Brieftasche.'''
>>> blob = TextBlob(text)
>>> blob.sentences
[Sentence("Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag."),
 Sentence("Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen."),
 Sentence("Aber leider habe ich nur noch EUR 18.50 in meiner Brieftasche.")]
>>> blob.tokens
WordList(['Heute', 'ist', 'der', '3.', 'Mai', ...]
>>> blob.tags
[('Heute', 'RB'), ('ist', 'VB'), ('der', 'DT'), ('3.', 'LS'), ('Mai', 'NN'),
('2014', 'CD'), ...]
# Default: Only noun_phrases that consist of two or more meaningful parts are displayed.
# Not perfect, but a start (relies heavily on parser accuracy)
>>> blob.noun_phrases
WordList(['Mai 2014', 'Dr. Meier', 'seinen 43. Geburtstag', 'Kuchen einzukaufen',
'meiner Brieftasche'])
>>> blob = TextBlob("Das Auto ist sehr schön.")
>>> blob.parse()
'Das/DT/B-NP/O Auto/NN/I-NP/O ist/VB/B-VP/O sehr/RB/B-ADJP/O schön/JJ/I-ADJP/O'
>>> from textblob_de import PatternParser
>>> blob = TextBlobDE(u"Das ist ein schönes Auto.", parser=PatternParser(pprint=True, lemmata=True))
>>> blob.parse()
      WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

       Das   DT     -       -      -      -      das
       ist   VB     VP      -      -      -      sein
       ein   DT     NP      -      -      -      ein
   schönes   JJ     NP ^    -      -      -      schön
      Auto   NN     NP ^    -      -      -      auto
         .   .      -       -      -      -      .
>>> from textblob_de import PatternTagger
>>> blob = TextBlob(text, pos_tagger=PatternTagger(include_punc=True))
[('Das', 'DT'), ('Auto', 'NN'), ('ist', 'VB'), ('sehr', 'RB'), ('schön', 'JJ'), ('.', '.')]
>>> blob = TextBlob("Das Auto ist sehr schön.")
>>> blob.sentiment
(1.0, 0.0)
>>> blob = TextBlob("Das ist ein hässliches Auto.")
>>> blob.sentiment
(-1.0, 0.0)
>>> blob.words.lemmatize()
WordList(['das', 'sein', 'ein', 'hässlich', 'Auto'])
>>> from textblob_de.lemmatizers import PatternParserLemmatizer
>>> _lemmatizer = PatternParserLemmatizer()
>>> _lemmatizer.lemmatize("Das ist ein hässliches Auto.")
[('das', 'DT'), ('sein', 'VB'), ('ein', 'DT'), ('hässlich', 'JJ'), ('Auto', 'NN')]

Access to pattern API in Python3

>>> from textblob_de.packages import pattern_de as pd
>>> print(pd.attributive("neugierig", gender=pd.FEMALE, role=pd.INDIRECT, article="die"))
neugierigen

Requirements

  • Python >= 2.6 or >= 3.3

TODO

  • TextBlob Extension: textblob-rftagger (wrapper class for RFTagger)

  • TextBlob Extension: textblob-cmd (command-line wrapper for TextBlob, basically TextBlob for files

  • TextBlob Extension: textblob-stanfordparser (wrapper class for StanfordParser via NLTK)

  • TextBlob Extension: textblob-berkeleyparser (wrapper class for BerkeleyParser)

  • TextBlob Extension: textblob-sent-align (sentence alignment for parallel TextBlobs)

  • TextBlob Extension: textblob-converters (various input and output conversions)

  • Additional PoS tagging options, e.g. NLTK tagging (NLTKTagger)

  • Improve noun phrase extraction (e.g. based on RFTagger output)

  • Improve sentiment analysis (find suitable subjectivity scores)

  • Improve functionality of Sentence() and Word() objects

  • Adapt more tests from textblob main package (esp. for TextBlobDE() in test_blob.py)

License

MIT licensed. See the bundled LICENSE file for more details.

Changelog

0.2.5 (04/08/2014)

  • Major internal refactoring (but no backwards-incompatible API changes) with the aim of restoring complete compatibility to original pattern>=2.6 library on Python2

  • Separation of textblob and pattern code

  • On Python2 the vendorized version of pattern.text.de is only used, if original is not installed (same as nltk)

  • Made pattern.de.pprint function and all parser keywords accessible to customise parser output

  • Access to complete pattern.text.de API on Python2 and Python3 from textblob_de.packages import pattern_de as pd

  • tox passed on all major platforms (Win/Linux/OSX)

0.2.3 (26/07/2014)

  • Lemmatizer: PatternParserLemmatizer() extracts lemmata from Parser output

  • Improved polarity analysis through look-up of lemmatised word forms

0.2.2 (22/07/2014)

  • Option: Include punctuation in tags/pos_tags properties (b = TextBlobDE(text, tagger=PatternTagger(include_punc=True)))

  • Added BlobberDE() class initialized with German models

  • TextBlobDE(), Sentence(), WordList() and Word() classes are now all initialized with German models

  • Restored complete API compatibility with textblob.tokenizers module of textblob main package

0.2.1 (20/07/2014)

  • Noun Phrase Extraction: PatternParserNPExtractor() extracts NPs from Parser output

  • Refactored the way TextBlobDE() passes on arguments and keyword arguments to individual tools

  • Backwards-incompatible: Deprecate parser_show_lemmata=True keyword in TextBlob(). Use parser=PatternParser(lemmata=True) instead.

0.2.0 (18/07/2014)

  • vastly improved tokenization (NLTKPunktTokenizer and PatternTokenizer with tests)

  • consistent use of specified tokenizer for all tools

  • TextBlobDE with initialized default models for German

  • Parsing (PatternParser) plus test_parsers.py

  • EXPERIMENTAL implementation of Polarity detection (PatternAnalyzer)

  • first attempt at extracting German Polarity clues into de-sentiment.xml

  • tox tests passing for py26, py27, py33 and py34

0.1.3 (09/07/2014)

  • First release on PyPI

0.1.0 - 0.1.2 (09/07/2014)

  • First release on github

  • A number of experimental releases for testing purposes

  • Adapted version badges, tests & travis-ci config

  • Code adapted from sample extension textblob-fr

  • Language specific linguistic resources copied from pattern-de

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textblob-de-0.2.5.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

textblob_de-0.2.5-py2.py3-none-any.whl (1.0 MB view details)

Uploaded Python 2 Python 3

File details

Details for the file textblob-de-0.2.5.tar.gz.

File metadata

  • Download URL: textblob-de-0.2.5.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for textblob-de-0.2.5.tar.gz
Algorithm Hash digest
SHA256 c6fdc88d565fc1e114c3b89492bb24ba9c32c96b63e9d5822f81bf2ce319ef66
MD5 36d5dafe5c3140dfa7c831cf76c43f58
BLAKE2b-256 88ed77f5863d0aba34353e6fc8f660f1f3723ca62ae0e7405853c2990e5e7eb7

See more details on using hashes here.

File details

Details for the file textblob_de-0.2.5-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for textblob_de-0.2.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6eb5f000c6b282ffbe9029744824c4cf962fdb7959adfa31b3518ae2134d0a7d
MD5 b2c6930626e2f0c4c759f6936424094f
BLAKE2b-256 3d3500710dacf341940975c0eca2b3b143b6106bb5fe61829785b21ee6ee05b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page