Skip to main content

Simple, Pythonic text processing. Sentiment analysis, POS tagging, noun phrase parsing, and more.

Project description

TextBlob

Travis-CI Number of PyPI downloads

Simplified text processing for Python 2 and 3.

Requirements

  • Python >= 2.6 or >= 3.3

Installation

If you don’t have pip (you should), run this first: curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python

Option 1

Choose this option if you:

  • Want a quick install.

  • Don’t have nltk currently installed, or don’t mind if your current installation is overriden by the latest version on the master branch. NOTE: You can also prevent the effects of this if you use textblob in a virtualenv.

pip install -U textblob
curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python

This will install textblob and download the necessary NLTK corpora.

Option 2

Choose this option if you:

  • Don’t want your local nltk installation to be overridden.

  • Want to keep your nltk on the bleeding edge of development.

pip install -U git+https://github.com/nltk/nltk
pip install -U git+https://github.com/sloria/TextBlob.git@no-bundle
curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python

This will install the latest NLTK from the master branch, install textblob from the no-bundle branch, and download the necessary corpora.

Usage

Simple.

Create a TextBlob

from text.blob import TextBlob

wikitext = '''
Python is a widely used general-purpose, high-level programming language.
Its design philosophy emphasizes code readability, and its syntax allows
programmers to express concepts in fewer lines of code than would be
possible in languages such as C.
'''

wiki = TextBlob(wikitext)

Part-of-speech tags and noun phrases…

...are just properties.

wiki.pos_tags       # [(Word('Python'), 'NNP'), (Word('is'), 'VBZ'),
                    #  (Word('a'), u'DT'), (Word('widely'), 'RB')...]

wiki.noun_phrases   # WordList(['python', 'design philosophy',  'code readability'])

Note: The first time you access noun_phrases might take a few seconds because the noun phrase chunker needs to be trained. Subsequent calls to noun_phrases will be quick, however, since all TextBlobs share the same instance of a noun phrase chunker.

Sentiment analysis

The sentiment property returns a tuple of the form (polarity, subjectivity) where polarity ranges from -1.0 to 1.0 and subjectivity ranges from 0.0 to 1.0.

testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
testimonial.sentiment        # (0.4583333333333333, 0.4357142857142857)

Tokenization

zen = TextBlob("Beautiful is better than ugly. "
                "Explicit is better than implicit. "
                "Simple is better than complex.")

zen.words            # WordList(['Beautiful', 'is', 'better'...])

zen.sentences        # [Sentence('Beautiful is better than ugly.'),
                      #  Sentence('Explicit is better than implicit.'),
                      #  ...]

for sentence in zen.sentences:
    print(sentence.sentiment)

Words and inflection

Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.

sentence = TextBlob('Use 4 spaces per indentation level.')
sentence.words
# OUT: WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
sentence.words[2].singularize()
# OUT: 'space'
sentence.words[-1].pluralize()
# OUT: 'levels'

Get word and noun phrase frequencies

wiki.word_counts['its']   # 2 (not case-sensitive by default)
wiki.words.count('its')   # Same thing
wiki.words.count('its', case_sensitive=True)  # 1

wiki.noun_phrases.count('code readability')  # 1

TextBlobs are like Python strings!

zen[0:19]            # TextBlob("Beautiful is better")
zen.upper()          # TextBlob("BEAUTIFUL IS BETTER THAN UGLY...")
zen.find("Simple")   # 65

apple_blob = TextBlob('apples')
banana_blob = TextBlob('bananas')
apple_blob < banana_blob           # True
apple_blob + ' and ' + banana_blob # TextBlob('apples and bananas')
"{0} and {1}".format(apple_blob, banana_blob)  # 'apples and bananas'

Get start and end indices of sentences

Use sentence.start and sentence.end. This can be useful for sentence highlighting, for example.

for sentence in zen.sentences:
    print(sentence)  # Beautiful is better than ugly
    print("---- Starts at index {}, Ends at index {}"\
                .format(sentence.start, sentence.end))  # 0, 30

Get a JSON-serialized version of the blob

zen.json   # '[{"sentiment": [0.2166666666666667, ' '0.8333333333333334],
            # "stripped": "beautiful is better than ugly", '
            # '"noun_phrases": ["beautiful"], "raw": "Beautiful is better than ugly. ", '
            # '"end_index": 30, "start_index": 0}
            #  ...]'

Advanced usage

Noun Phrase Chunkers

TextBlob currently has two noun phrases chunker implementations, text.np_extractors.FastNPExtractor (default, based on Shlomi Babluki’s implementation from this blog post) and text.np_extractors.ConllExtractor, which uses the CoNLL 2000 corpus to train a tagger.

You can change the chunker implementation (or even use your own) by explicitly passing an instance of a noun phrase extractor to a TextBlob’s constructor.

from text.blob import TextBlob
from text.np_extractors import ConllExtractor

extractor = ConllExtractor()
blob = TextBlob("Extract my noun phrases.", np_extractor=extractor)
blob.noun_phrases  # This will use the Conll2000 noun phrase extractor

POS Taggers

TextBlob currently has two POS tagger implementations, located in text.taggers. The default is the PatternTagger which uses the same implementation as the excellent pattern library.

The second implementation is NLTKTagger which uses NLTK’s TreeBank tagger. It requires numpy and only works on Python 2.

Similar to the noun phrase chunkers, you can explicitly specify which POS tagger to use by passing a tagger instance to the constructor.

from text.blob import TextBlob
from text.taggers import NLTKTagger

nltk_tagger = NLTKTagger()
blob = TextBlob("Tag! You're It!", pos_tagger=nltk_tagger)
blob.pos_tags

Testing

Run

python run_tests.py

to run all tests.

License

TextBlob is licenced under the MIT license. See the bundled LICENSE file for more details.

Changelog

0.3.9 (2013-07-31)

  • Updated nltk.

  • ConllExtractor is now Python 3-compatible.

  • Improved sentiment analysis.

  • Blobs are equal (with ==) to their string counterparts.

  • Added instructions to install textblob without nltk bundled.

  • Dropping official 3.1 and 3.2 support.

0.3.8 (2013-07-30)

  • Importing TextBlob is now much faster. This is because the noun phrase parsers are trained only on the first call to noun_phrases (instead of training them every time you import TextBlob).

  • Add text.taggers module which allows user to change which POS tagger implementation to use. Currently supports PatternTagger and NLTKTagger (NLTKTagger only works with Python 2).

  • NPExtractor and Tagger objects can be passed to TextBlob’s constructor.

  • Fix bug with POS-tagger not tagging one-letter words.

  • Rename text/np_extractor.py -> text/np_extractors.py

  • Add run_tests.py script.

0.3.7 (2013-07-28)

  • Every word in a Blob or Sentence is a Word instance which has methods for inflection, e.g word.pluralize() and word.singularize().

  • Updated the np_extractor module. Now has an new implementation, ConllExtractor that uses the Conll2000 chunking corpus. Only works on Py2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textblob-0.3.9.tar.gz (1.3 MB view hashes)

Uploaded Source

Built Distribution

textblob-0.3.9-py2.py3-none-any.whl (1.4 MB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page