Simple, Pythonic text processing. Sentiment analysis, POS tagging, noun phrase parsing, and more.
Project description
TextBlob
Simplified text processing for Python 2 and 3.
Requirements
Python >= 2.6 or >= 3.1
Installation
There are two options for installing textblob:
Option 1 includes the a bundled version of NLTK (the latest from the Github master branch). Though this option is quicker, this will override your local NLTK installation if you have one. If this concerns you, then prefer Option 2, or use textblob in a virtualenv.
Option 2 does not include NLTK, so you will have to install the latest version manually.
Instructions for both options are below.
If you don’t have pip (you should), run this first: curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python
Option 1: With bundled NLTK
pip install textblob curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python
This will install textblob and download the necessary NLTK corpora.
Option 2: Install textblob and NLTK separately
pip install git+https://github.com/nltk/nltk pip install git+https://github.com/sloria/TextBlob.git@no-bundle curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python
This will install the latest NLTK from the master branch, as well as the latest version of textblob from the no-bundle branch.
Usage
Simple.
Create a TextBlob
from text.blob import TextBlob
wikitext = '''
Python is a widely used general-purpose, high-level programming language.
Its design philosophy emphasizes code readability, and its syntax allows
programmers to express concepts in fewer lines of code than would be
possible in languages such as C.
'''
wiki = TextBlob(wikitext)
Sentiment analysis
The sentiment property returns a tuple of the form (polarity, subjectivity) where polarity ranges from -1.0 to 1.0 and subjectivity ranges from 0.0 to 1.0.
testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
testimonial.sentiment # (0.4583333333333333, 0.4357142857142857)
Tokenization
zen = TextBlob("Beautiful is better than ugly. "
"Explicit is better than implicit. "
"Simple is better than complex.")
zen.words # WordList(['Beautiful', 'is', 'better'...])
zen.sentences # [Sentence('Beautiful is better than ugly.'),
# Sentence('Explicit is better than implicit.'),
# ...]
for sentence in zen.sentences:
print(sentence.sentiment)
Words and inflection
Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.
sentence = TextBlob('Use 4 spaces per indentation level.')
sentence.words
# OUT: WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
sentence.words[2].singularize()
# OUT: 'space'
sentence.words[-1].pluralize()
# OUT: 'levels'
Get word and noun phrase frequencies
wiki.word_counts['its'] # 2 (not case-sensitive by default)
wiki.words.count('its') # Same thing
wiki.words.count('its', case_sensitive=True) # 1
wiki.noun_phrases.count('code readability') # 1
TextBlobs are like Python strings!
zen[0:19] # TextBlob("Beautiful is better")
zen.upper() # TextBlob("BEAUTIFUL IS BETTER THAN UGLY...")
zen.find("Simple") # 65
apple_blob = TextBlob('apples')
banana_blob = TextBlob('bananas')
apple_blob < banana_blob # True
apple_blob + ' and ' + banana_blob # TextBlob('apples and bananas')
"{0} and {1}".format(apple_blob, banana_blob) # 'apples and bananas'
Get start and end indices of sentences
Use sentence.start and sentence.end. This can be useful for sentence highlighting, for example.
for sentence in zen.sentences:
print(sentence) # Beautiful is better than ugly
print("---- Starts at index {}, Ends at index {}"\
.format(sentence.start, sentence.end)) # 0, 30
Get a JSON-serialized version of the blob
zen.json # '[{"sentiment": [0.2166666666666667, ' '0.8333333333333334],
# "stripped": "beautiful is better than ugly", '
# '"noun_phrases": ["beautiful"], "raw": "Beautiful is better than ugly. ", '
# '"end_index": 30, "start_index": 0}
# ...]'
Advanced usage
Noun Phrase Chunkers
TextBlob currently has two noun phrases chunker implementations, text.np_extractors.FastNPExtractor (default, based on Shlomi Babluki’s implementation from this blog post) and text.np_extractors.ConllExtractor, which uses the CoNLL 2000 corpus to train a tagger.
You can change the chunker implementation (or even use your own) by explicitly passing an instance of a noun phrase extractor to a TextBlob’s constructor.
from text.blob import TextBlob
from text.np_extractors import ConllExtractor
extractor = ConllExtractor()
blob = TextBlob("Extract my noun phrases.", np_extractor=extractor)
blob.noun_phrases # This will use the Conll2000 noun phrase extractor
POS Taggers
TextBlob currently has two POS tagger implementations, located in text.taggers. The default is the PatternTagger which uses the same implementation as the excellent pattern library.
The second implementation is NLTKTagger which uses NLTK’s TreeBank tagger. It requires numpy and only works on Python 2.
Similar to the noun phrase chunkers, you can explicitly specify which POS tagger to use by passing a tagger instance to the constructor.
from text.blob import TextBlob
from text.taggers import NLTKTagger
nltk_tagger = NLTKTagger()
blob = TextBlob("Tag! You're It!", pos_tagger=nltk_tagger)
blob.pos_tags
Testing
Run
python run_tests.py
to run all tests.
License
TextBlob is licenced under the MIT license. See the bundled LICENSE file for more details.
Changelog
0.3.9 (unreleased)
Updated nltk.
ConllExtractor is now Python 3-compatible.
Improved sentiment analysis.
Blobs are equal (with ==) to their string counterparts.
Added instructions to install textblob without nltk bundled.
0.3.8 (2013-07-30)
Importing TextBlob is now much faster. This is because the noun phrase parsers are trained only on the first call to noun_phrases (instead of training them every time you import TextBlob).
Add text.taggers module which allows user to change which POS tagger implementation to use. Currently supports PatternTagger and NLTKTagger (NLTKTagger only works with Python 2).
NPExtractor and Tagger objects can be passed to TextBlob’s constructor.
Fix bug with POS-tagger not tagging one-letter words.
Rename text/np_extractor.py -> text/np_extractors.py
Add run_tests.py script.
0.3.7 (2013-07-28)
Every word in a Blob or Sentence is a Word instance which has methods for inflection, e.g word.pluralize() and word.singularize().
Updated the np_extractor module. Now has an new implementation, ConllExtractor that uses the Conll2000 chunking corpus. Only works on Py2.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for textblob-0.3.8-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29c769ff180ab337cf440ca0008a048fd78a5b24d53506b3aa64f90a4a15d0f5 |
|
MD5 | 4c24236cecb42d45c43ecac2b7c468aa |
|
BLAKE2b-256 | d9ecc632c2360eaf2d77a2c1ebbf9762060403f42820b885eafcd0ecaff13b98 |