Skip to main content

Semantic fingerprinting

Project description

guise

Semantic fingerprinting made easy.

To install: pip install guise

Overview

guise is a Python package designed to simplify the process of creating word clouds from text data. It provides tools to transform text into word clouds, allowing for easy visualization of the most prominent words in a given text. The package includes functionalities for text tokenization, word weighting, and customization of word cloud generation.

Key features:

  • Create word clouds from text files or strings.
  • Customize word tokenization and weighting.
  • Control the appearance and content of word clouds.
  • Use built-in stopwords and TF-IDF weighting.

The word_cloud_store function

The word_cloud_store is a function that creates a word cloud store from a text store. (By "store" we mean Mapping (the fancy way of saying dict-like)). The text store is a mapping from keys to text strings. The word cloud store is a mapping from keys to word clouds.

Let's have some fun with that...

Clouding the files of guise

from dol import TextFiles, filt_iter  # don't worry about installing it, guise comes with dol
import guise
import os

guise_rootdir = os.path.dirname(guise.__file__)
guise_files = filt_iter(TextFiles(guise_rootdir), filt=lambda x: x.endswith('.py'))
list(guise_files)
['util.py',
 'tools.py',
 '__init__.py',
 'word_scoring.py',
 'nlp.py',
 'word_clouds.py',
 'examples/__init__.py',
 'examples/energy_startups/__init__.py']

See that guise_files is a Mapping (that means "dict-like") of text. We often call that a (text) "store". The keys are the file names (as seen above with list(guise_files)) and the values are the contents of the file. See the first 30 characters of 'word_clouds.py'.

guise_files['word_clouds.py'][:30]
'"""Word cloud generation with '

Now, if you give a text store (that is, any Mapping (e.g. dict) of text) to word_cloud_store, it'll give you back a store with the same keys...

from guise import word_cloud_store

guise_clouds = word_cloud_store(guise_files)
list(guise_clouds)
['util.py',
 'tools.py',
 '__init__.py',
 'word_scoring.py',
 'nlp.py',
 'word_clouds.py',
 'examples/__init__.py',
 'examples/energy_startups/__init__.py']
guise_clouds['word_clouds.py']

png

guise_clouds['nlp.py']

png

Controlling the words that get into the clouds

Of course, like all good tool, it's simple to use, but you can bend it in many ways, to do many things.

For example, say you didn't like the clouds above. There are too many code bits, underscored words, code-specific names (e.g. "self"). They don't really represent what's in the module. You'd rather that the word clouds have english words in them. Perhaps it's better to not just use the code, but only the doc strings therein...

# Make a function that will extract the docstrings from a Python file
def just_the_docs(python_code, sep='\n'):
    """Extract the docstrings from a Python file.
    All the docstrings: The module's, the classes', the functions'.
    Then concatenate them all into one big string.
    """
    import ast
    module = ast.parse(python_code)
    docstrings = []
    for node in ast.walk(module):
        if isinstance(node, ast.FunctionDef):
            docstrings.append(ast.get_docstring(node))
        elif isinstance(node, ast.ClassDef):
            docstrings.append(ast.get_docstring(node))
    aggregate = sep.join(filter(None, docstrings))
    return aggregate.lower()  # also, let's lower case everything (to count "The" and "the" as the same word)

# Use this little dol tool...
from dol import wrap_kvs
# ... to make a new key-value store that, instead of giving you the contents of the files,
# gives you the docstrings therein. 
guise_doc_files = wrap_kvs(guise_files, value_decoder=just_the_docs)

# Now, instead of using the default str_to_words, we'll only extract sequences of letters as words
# (no numbers or underscores, which is the default). 

guise_doc_clouds = word_cloud_store(guise_doc_files, str_to_words='[a-zA-Z]+')
# Note: this is equivalent to using str_to_words=re.compile('[a-zA-Z]+').findall
#   or str_to_words=lambda x: re.compile('[a-zA-Z]+').findall(x)

And now gaze at the clouds...

guise_doc_clouds['word_clouds.py']

png

guise_doc_clouds['nlp.py']

png

Controlling the weight of the words in the cloud

You probably noticed that a significant amount of visual space is taken by common words.

Words like "the", "of", "to" etc. don't really convey much information. Further, maybe the docs still contain some words you'd like to ignore. The NLP concept of "stopwords" takes care of this.

Finally, to really nail the "this word cloud gives us a snap shot of the what the underlying text is about, we probably want to weigh the words, yes, according to how frequently they're used in the text, but also lower the weight according to how frequently they're used in normal language, to really highlight what is particular about the text. This is just the idea of "stop words" made "smooth" (no black/white decision, but a scale).

from guise import STOPWORDS, TFIDFCalculator

# Make a list of stopwords (words to ignore) that includes the default ones plus a few more
my_stopwords = (
    list(STOPWORDS)  # the usual stopwords in English
    + ['dict', 'list', 'pandas', 'series']   # edit this to add or remove more
)

better_guise_doc_clouds = word_cloud_store(
    guise_doc_files, 
    str_to_words='[a-zA-Z]+',
    # remove the stopwords from the word cloud completely
    stopwords=my_stopwords,
    # Use "normal" English word frequencies to weigh the words in such a way that 
    # common words are less important
    words_to_weights=TFIDFCalculator('english')  
)
better_guise_doc_clouds['word_clouds.py']

png

better_guise_doc_clouds['nlp.py']

png

The word_cloud function

The word_cloud_store function uses the word_cloud function to do it's job.

And the word word_cloud function just adds some useful parameters to the already excellent wordcloud package (wordcloud documentation).

The main raison d'être of word_cloud over the already existing tools of the wordcloud package is that I wanted to make easier to make parametrized text-sources-to-word-clouds pipelines. Namely, be able to control better the extraction of text, the corresponding words (a.k.a. "tokenization") and weighting of these words.

from guise import word_cloud
list_of_words = ['Rose', 'rose', 'to', 'put', 'rose', 'roes', 'on', 'her', 'rows', 'of', 'roses']
a_word_cloud_obj = word_cloud(list_of_words)

a_word_cloud_obj is a WordCloud object, from which you can do all kinds of things. See word_cloud's documentation for more information. The two main methods we might be interested here are the to_file(...) methods and to_image() method. The later will give you a PIL.Image object, which namely, displays in a notebook if it's the last statement in the cell.

a_word_cloud_obj.to_image()

png

Transforming the output

The wc_decoder argument lets you transform the output on the fly. So if you wanted to get a PIL image object directly, you could do this:

word_cloud(list_of_words, wc_decoder=lambda x: x.to_image())

png

Always getting the same output

Notice that the output was different, though the input was the same. That's because there's some randomness builtin the process. Randomness is often a good thing for creative processes, but if you need consistent outputs, you can fix the random_state.

word_cloud(list_of_words, wc_decoder=lambda x: x.to_image(), random_state=42)

png

word_cloud(list_of_words, wc_decoder=lambda x: x.to_image(), random_state=42)

png

Fixing your parameters

Once you know what you want your parameters to be, you can fix them by calling the function without the first words input, and you'll get a function (it's called a "partial" function) with those parameters you specified fixed.

For example, let's make a word_cloud_image that has the wc_decoder and random_state fixed:

word_cloud_image = word_cloud(wc_decoder=lambda x: x.to_image(), random_state=42)

word_cloud_image(list_of_words)

png

Specifying word weights explicitly

Note that the words are more prominent, depending on the number of times it is repeated. You can control this "word weight" directly if you specify a {word: weight, ...} dict input.

word_cloud_image({'little': 0.2, 'BIG': 20})

png

# Q: Why does BIG have a weight 100 times little, yet doesn't appear 100 times bigger?
# A: Because of wordcloud internal defaults meant for general aesthetics. You can control this with extra arguments. For example:
word_cloud({'little': 0.2, 'BIG': 20}, min_font_size=1, max_font_size=1000).to_image()

png

Input simple text

If you input just a string, word_cloud will both extract the words and (count-)weight them for you:

word_cloud_image('That that is is that that is not is not is that it it is.')

png

Tokenization (from text to words)

quote_1 = "Success is not-final, failure is not-fatal: it is the courage-to-continue that counts."  # – Winston Churchill
word_cloud_image(quote_1)

png

Note how "not-final" was split into two words, "not" and "final", and that no punctuation was included. word_cloud has it's own default way of breaking the text into words (this is called "tokenization"). But you can control this by specifying your own str_to_words function.

Say you want to just split over spaces to get your "words" (really "tokens", because you'll get punctuation, composed words, etc.):

word_cloud_image(quote_1, str_to_words=lambda x: x.split(' '))

png

If you specify a string as your str_to_words, it will be considered as a regular expression that defines what a matching word should be. Here we define a word to be any uninterrupted sequence of letters, numbers or underscores (that's the \w part) or dashes.

word_cloud_image(quote_1, str_to_words='[\w-]+')

png

Control the weighting of the words

Of course, you can always take your text, tokenize it into words, then put those words through a weighting process, then give it to word_cloud. But if you want that process to happen automatically when given some text, ou can specify str_to_words, as we've seen above, for the tokenization, and can also specify the word weighting function through the words_to_weights argument.

james_had = "James, while John had had 'had,' had had 'had had'; 'had had' had had a better effect on the teacher."

# Let's weight the words by the parity of their length:
word_cloud_image(james_had, words_to_weights=lambda x: 1 + (len(x) % 2))

png

That words_to_weights was a silly one, just for fun. Usually, you'll put a bit more work in to get a word weighting function that is appropriate for your narrative. The guise package offers some tools for making these in the guise.nlp module.

For example, we have the TFIDFCalculator, with which you can weigh words "up" as their word count increases (the TF part) but also weight them "down" as those words are considered more common in the particular context you're looking at. The tool even offers a convenient way to just compare to "general english langauge usage", using statistics over 333K+ words of a large English corpus.

from guise import TFIDFCalculator

words_to_weights = TFIDFCalculator('english')
word_cloud_image(james_had, words_to_weights=words_to_weights)

png

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

guise-0.0.8.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

guise-0.0.8-py3-none-any.whl (22.0 kB view details)

Uploaded Python 3

File details

Details for the file guise-0.0.8.tar.gz.

File metadata

  • Download URL: guise-0.0.8.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for guise-0.0.8.tar.gz
Algorithm Hash digest
SHA256 fb6bcdd59c03b08469b4d6822716a5ee28ab5a332c4b847940df5c5a0102c986
MD5 d8e80674679b8972c8e3388b5ed38516
BLAKE2b-256 592f7f3aa63f49986573f16fc9ee3457c1281ff77445103cc409e7e99a183b49

See more details on using hashes here.

File details

Details for the file guise-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: guise-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 22.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for guise-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 74ecd80b8c79407fa6c83a63eb58cbe1a9fb065a9729bd4ff478daf836bfd3e4
MD5 5069be142175a35de770688b9a8f2101
BLAKE2b-256 72f80372f585d61cd7cf1c8ebcddd4d1f133c3d9c0c2d6d407b2014d22e09c14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page