Simple, Pythonic text processing. Sentiment analysis, POS tagging, noun phrase parsing, and more.
Project description
TextBlob
Simplified text processing for Python 2 and 3.
Requirements
Python >= 2.6 or >= 3.1
Installation
TextBlob’s only external dependency is PyYAML. A vendorized version of NLTK is bundled internally.
If you have pip:
pip install textblob
Or (if you must):
easy_install textblob
IMPORTANT: TextBlob depends on some NLTK corpora to work. The easiest way to get these is to run this command:
curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python
You can also download the script here . Then run:
python download_corpora.py
Usage
Simple.
Create a TextBlob
from text.blob import TextBlob
wikitext = '''
Python is a widely used general-purpose, high-level programming language.
Its design philosophy emphasizes code readability, and its syntax allows
programmers to express concepts in fewer lines of code than would be
possible in languages such as C.
'''
wiki = TextBlob(wikitext)
Sentiment analysis
The sentiment property returns a tuple of the form (polarity, subjectivity) where polarity ranges from -1.0 to 1.0 and subjectivity ranges from 0.0 to 1.0.
blob.sentiment # (0.20, 0.58)
Tokenization
zen = TextBlob("Beautiful is better than ugly. "
"Explicit is better than implicit. "
"Simple is better than complex.")
zen.words # WordList(['Beautiful', 'is', 'better'...])
zen.sentences # [Sentence('Beautiful is better than ugly.'),
# Sentence('Explicit is better than implicit.'),
# ...]
Words and inflection
Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.
sentence = TextBlob('Use 4 spaces per indentation level.')
sentence.words
# OUT: WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
sentence.words[2].singularize()
# OUT: 'space'
sentence.words[-1].pluralize()
# OUT: 'levels'
Get word and noun phrase frequencies
wiki.word_counts['its'] # 2 (not case-sensitive by default)
wiki.words.count('its') # Same thing
wiki.words.count('its', case_sensitive=True) # 1
wiki.noun_phrases.count('code readability') # 1
TextBlobs are like Python strings!
zen[0:19] # TextBlob("Beautiful is better")
zen.upper() # TextBlob("BEAUTIFUL IS BETTER THAN UGLY...")
zen.find("Simple") # 65
apple_blob = TextBlob('apples')
banana_blob = TextBlob('bananas')
apple_blob < banana_blob # True
apple_blob + ' and ' + banana_blob # TextBlob('apples and bananas')
"{0} and {1}".format(apple_blob, banana_blob) # 'apples and bananas'
Get start and end indices of sentences
Use sentence.start and sentence.end. This can be useful for sentence highlighting, for example.
for sentence in zen.sentences:
print(sentence) # Beautiful is better than ugly
print("---- Starts at index {}, Ends at index {}"\
.format(sentence.start, sentence.end)) # 0, 30
Get a JSON-serialized version of the blob
zen.json # '[{"sentiment": [0.2166666666666667, ' '0.8333333333333334],
# "stripped": "beautiful is better than ugly", '
# '"noun_phrases": ["beautiful"], "raw": "Beautiful is better than ugly. ", '
# '"end_index": 30, "start_index": 0}
# ...]'
Overriding the noun phrase extractor
TextBlob currently has two noun phrases chunker implementations, text.np_extractor.FastNPExtractor (default, based on Shlomi Babluki’s implementation from this blog post) and text.np_extractor.ConllExtractor (currently working on Python 2 only).
You can change the chunker implementation (or even use your own) by overriding TextBlob.np_extractor
from text.np_extractor import ConllExtractor
extractor = ConllExtractor()
blob = TextBlob("Python is a widely used general-purpose, high-level programming language.")
blob.np_extractor = extractor
blob.noun_phrases # This will use the Conll2000 noun phrase extractor
Testing
Run
nosetests
to run all tests.
License
TextBlob is licenced under the MIT license. See the bundled LICENSE file for more details.
Changelog for textblob
0.3.7 (2013-07-29)
Every word in a Blob or Sentence is a Word instance which has methods for inflection, e.g word.pluralize() and word.singularize().
Updated the np_extractor module. Now has an new implementation, ConllExtractor that uses the Conll2000 chunking corpus. Only works on Py2.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for textblob-0.3.7-py27-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e1099f9eecb2c926d8826869b34d0613d5acabd7f4de26ba3e654d82d8043d1 |
|
MD5 | e0ce1f702bb2ec27dee2c49bc9055371 |
|
BLAKE2b-256 | 7f4d8e5f83498b96fb573ae86ac7963723575cdcbe3c56e946798727c5aef0c7 |
Hashes for textblob-0.3.7-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f8512cedcc0ace23ae8d3c3ab1b649658aa5262b1a4bb1dfb0fca8a757334c5 |
|
MD5 | 5f1cab2f68adea4f2c1596f8dbc850cb |
|
BLAKE2b-256 | 5c1db56cfda41269046e7f4e7bccae87eb86f1aa835663a58ff5c0707e7adb88 |