Skip to main content

German language support for TextBlob.

Project description

textblob_de - latest PyPI version Travis-CI Documentation Status Number of PyPI downloads LICENSE info

German language support for TextBlob by Steven Loria.

This python package is being developed as a TextBlob Language Extension. See Extension Guidelines for details.

Features

  • All directly accessible textblob_de classes (e.g. Sentence() or Word()) are initialized with default models for German

  • Properties or methods that do not yet work for German raise a NotImplementedError

  • German sentence boundary detection and tokenization (NLTKPunktTokenizer)

  • Consistent use of specified tokenizer for all tools (NLTKPunktTokenizer or PatternTokenizer)

  • Part-of-speech tagging (PatternTagger) with keyword include_punc=True (defaults to False)

  • Parsing (PatternParser) with all pattern keywords, plus pprint=True (defaults to False)

  • Noun Phrase Extraction (PatternParserNPExtractor)

  • Lemmatization (PatternParserLemmatizer)

  • Polarity detection (PatternAnalyzer) - Still EXPERIMENTAL, does not yet have information on subjectivity

  • Full pattern.text.de API support on Python3

  • Supports Python 2 and 3

  • See working features overview for details

Installing/Upgrading

$ pip install -U textblob-de
$ python -m textblob.download_corpora

Or the latest development release (apparently this does not always work on Windows see issues #1744/5 for details):

$ pip install -U git+https://github.com/markuskiller/textblob-de.git@dev
$ python -m textblob.download_corpora

Usage

>>> from textblob_de import TextBlobDE as TextBlob
>>> text = '''Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag.
Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen. Aber leider
habe ich nur noch EUR 3.50 in meiner Brieftasche.'''
>>> blob = TextBlob(text)
>>> blob.sentences
[Sentence("Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag."),
 Sentence("Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen."),
 Sentence("Aber leider habe ich nur noch EUR 3.50 in meiner Brieftasche.")]
>>> blob.tokens
WordList(['Heute', 'ist', 'der', '3.', 'Mai', ...]
>>> blob.tags
[('Heute', 'RB'), ('ist', 'VB'), ('der', 'DT'), ('3.', 'LS'), ('Mai', 'NN'),
('2014', 'CD'), ...]
# Default: Only noun_phrases that consist of two or more meaningful parts are displayed.
# Not perfect, but a start (relies heavily on parser accuracy)
>>> blob.noun_phrases
WordList(['Mai 2014', 'Dr. Meier', 'seinen 43. Geburtstag', 'Kuchen einzukaufen',
'meiner Brieftasche'])
>>> blob = TextBlob("Das Auto ist sehr schön.")
>>> blob.parse()
'Das/DT/B-NP/O Auto/NN/I-NP/O ist/VB/B-VP/O sehr/RB/B-ADJP/O schön/JJ/I-ADJP/O'
>>> from textblob_de import PatternParser
>>> blob = TextBlobDE("Das ist ein schönes Auto.", parser=PatternParser(pprint=True, lemmata=True))
>>> blob.parse()
      WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

       Das   DT     -       -      -      -      das
       ist   VB     VP      -      -      -      sein
       ein   DT     NP      -      -      -      ein
   schönes   JJ     NP ^    -      -      -      schön
      Auto   NN     NP ^    -      -      -      auto
         .   .      -       -      -      -      .
>>> from textblob_de import PatternTagger
>>> blob = TextBlob(text, pos_tagger=PatternTagger(include_punc=True))
[('Das', 'DT'), ('Auto', 'NN'), ('ist', 'VB'), ('sehr', 'RB'), ('schön', 'JJ'), ('.', '.')]
>>> blob = TextBlob("Das Auto ist sehr schön.")
>>> blob.sentiment
Sentiment(polarity=1.0, subjectivity=0.0)
>>> blob = TextBlob("Das ist ein hässliches Auto.")
>>> blob.sentiment
Sentiment(polarity=-1.0, subjectivity=0.0)
>>> blob.words.lemmatize()
WordList(['das', 'sein', 'ein', 'hässlich', 'Auto'])
>>> from textblob_de.lemmatizers import PatternParserLemmatizer
>>> _lemmatizer = PatternParserLemmatizer()
>>> _lemmatizer.lemmatize("Das ist ein hässliches Auto.")
[('das', 'DT'), ('sein', 'VB'), ('ein', 'DT'), ('hässlich', 'JJ'), ('Auto', 'NN')]

Access to pattern API in Python3

>>> from textblob_de.packages import pattern_de as pd
>>> print(pd.attributive("neugierig", gender=pd.FEMALE, role=pd.INDIRECT, article="die"))
neugierigen

Documentation and API Reference

Requirements

  • Python >= 2.6 or >= 3.3

TODO

  • Planned Extensions

  • Additional PoS tagging options, e.g. NLTK tagging (NLTKTagger)

  • Improve noun phrase extraction (e.g. based on RFTagger output)

  • Improve sentiment analysis (find suitable subjectivity scores)

  • Improve functionality of Sentence() and Word() objects

  • Adapt more tests from the main TextBlob library (esp. for TextBlobDE() in test_blob.py)

License

MIT licensed. See the bundled LICENSE file for more details.

Thanks

Coded with Wing IDE 5.0 (free open source developer license)

Python IDE for Python - wingware.com

Changelog

0.4.1 (03/10/2014)

  • Removed dependency on nltk’s depricated PunktWordTokenizer and replaced it with TreebankWordTokenizer see nltk/nltk#746 (comment) for details

0.4.0 (17/09/2014)

  • Fixed Issue #7 (restore textblob>=0.9.0 compatibility)

  • Depend on nltk3. Vendorized nltk was removed in textblob>=0.9.0

  • Fixed ImportError on Python2 (unicodecsv)

0.3.1 (29/08/2014)

  • Improved PatternParserNPExtractor (less false positives in verb filter)

  • Made sure that all keyword arguments with default None are checked with is not None

  • Fixed shortcut to _pattern.de in vendorized library

  • Added Makefile to facilitate development process

  • Added docs and API reference

0.3.0 (14/08/2014)

  • Fixed Issue #5 (text + space + period)

0.2.9 (14/08/2014)

  • Fixed tokenization in PatternParser (if initialized manually, punctuation was not always separated from words)

  • Improved handling of empty strings (Issue #3) and of strings containing single punctuation marks (Issue #4) in PatternTagger and PatternParser

  • Added tests for empty strings and for strings containing single punctuation marks

0.2.8 (14/08/2014)

0.2.7 (13/08/2014)

  • Fixed Issue #1 lemmatization of strings containing a forward slash (/)

  • Enhancement Issue #2 use the same rtype as textblob for sentiment detection.

  • Fixed tokenization in PatternParserLemmatizer

0.2.6 (04/08/2014)

  • Fixed MANIFEST.in for package data in sdist

0.2.5 (04/08/2014)

  • sdist is non-functional as important files are missing due to a misconfiguration in MANIFEST.in - does not affect wheels

  • Major internal refactoring (but no backwards-incompatible API changes) with the aim of restoring complete compatibility to original pattern>=2.6 library on Python2

  • Separation of textblob and pattern code

  • On Python2 the vendorized version of pattern.text.de is only used, if original is not installed (same as nltk)

  • Made pattern.de.pprint function and all parser keywords accessible to customise parser output

  • Access to complete pattern.text.de API on Python2 and Python3 from textblob_de.packages import pattern_de as pd

  • tox passed on all major platforms (Win/Linux/OSX)

0.2.3 (26/07/2014)

  • Lemmatizer: PatternParserLemmatizer() extracts lemmata from Parser output

  • Improved polarity analysis through look-up of lemmatised word forms

0.2.2 (22/07/2014)

  • Option: Include punctuation in tags/pos_tags properties (b = TextBlobDE(text, tagger=PatternTagger(include_punc=True)))

  • Added BlobberDE() class initialized with German models

  • TextBlobDE(), Sentence(), WordList() and Word() classes are now all initialized with German models

  • Restored complete API compatibility with textblob.tokenizers module of the main TextBlob library

0.2.1 (20/07/2014)

  • Noun Phrase Extraction: PatternParserNPExtractor() extracts NPs from Parser output

  • Refactored the way TextBlobDE() passes on arguments and keyword arguments to individual tools

  • Backwards-incompatible: Deprecate parser_show_lemmata=True keyword in TextBlob(). Use parser=PatternParser(lemmata=True) instead.

0.2.0 (18/07/2014)

  • vastly improved tokenization (NLTKPunktTokenizer and PatternTokenizer with tests)

  • consistent use of specified tokenizer for all tools

  • TextBlobDE with initialized default models for German

  • Parsing (PatternParser) plus test_parsers.py

  • EXPERIMENTAL implementation of Polarity detection (PatternAnalyzer)

  • first attempt at extracting German Polarity clues into de-sentiment.xml

  • tox tests passing for py26, py27, py33 and py34

0.1.3 (09/07/2014)

  • First release on PyPI

0.1.0 - 0.1.2 (09/07/2014)

  • First release on github

  • A number of experimental releases for testing purposes

  • Adapted version badges, tests & travis-ci config

  • Code adapted from sample extension textblob-fr

  • Language specific linguistic resources copied from pattern-de

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textblob-de-0.4.1.zip (2.2 MB view details)

Uploaded Source

Built Distribution

textblob_de-0.4.1-py2.py3-none-any.whl (566.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file textblob-de-0.4.1.zip.

File metadata

  • Download URL: textblob-de-0.4.1.zip
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for textblob-de-0.4.1.zip
Algorithm Hash digest
SHA256 46786f9c9ff91418d0662aab228ecef842b86dd3b70bb6618c96b8050ae794a0
MD5 7367f3e40546612da6360899b3a73f3b
BLAKE2b-256 ab371bd0944db19c1358c9a18f7d684e2a0b08c2ed76e6e2b878baf6dbd4a23e

See more details on using hashes here.

File details

Details for the file textblob_de-0.4.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for textblob_de-0.4.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 39e7d659c19d1106cb9c7c398b1cf3477fc59da6e56aec6a7b6d5080d76aff63
MD5 2731cafcb9bad5eb4d499593ec07988f
BLAKE2b-256 8daa52a9641a476d38c4bbd207770862ad290fedd80d65f4a5009ae0d6bb3a86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page