Semantic fingerprinting

Project description

guise

Semantic fingerprinting made easy.

To install: pip install guise

Overview

guise is a Python package designed to simplify the process of creating word clouds from text data. It provides tools to transform text into word clouds, allowing for easy visualization of the most prominent words in a given text. The package includes functionalities for text tokenization, word weighting, and customization of word cloud generation.

Key features:

Create word clouds from text files or strings.
Customize word tokenization and weighting.
Control the appearance and content of word clouds.
Use built-in stopwords and TF-IDF weighting.

The `word_cloud_store` function

The word_cloud_store is a function that creates a word cloud store from a text store. (By "store" we mean Mapping (the fancy way of saying dict-like)). The text store is a mapping from keys to text strings. The word cloud store is a mapping from keys to word clouds.

Let's have some fun with that...

Clouding the files of `guise`

from dol import TextFiles, filt_iter  # don't worry about installing it, guise comes with dol
import guise
import os

guise_rootdir = os.path.dirname(guise.__file__)
guise_files = filt_iter(TextFiles(guise_rootdir), filt=lambda x: x.endswith('.py'))
list(guise_files)

['util.py',
 'tools.py',
 '__init__.py',
 'word_scoring.py',
 'nlp.py',
 'word_clouds.py',
 'examples/__init__.py',
 'examples/energy_startups/__init__.py']

See that guise_files is a Mapping (that means "dict-like") of text. We often call that a (text) "store". The keys are the file names (as seen above with list(guise_files)) and the values are the contents of the file. See the first 30 characters of 'word_clouds.py'.

guise_files['word_clouds.py'][:30]

'"""Word cloud generation with '

Now, if you give a text store (that is, any Mapping (e.g. dict) of text) to word_cloud_store, it'll give you back a store with the same keys...

from guise import word_cloud_store

guise_clouds = word_cloud_store(guise_files)
list(guise_clouds)

['util.py',
 'tools.py',
 '__init__.py',
 'word_scoring.py',
 'nlp.py',
 'word_clouds.py',
 'examples/__init__.py',
 'examples/energy_startups/__init__.py']

guise_clouds['word_clouds.py']

png

guise_clouds['nlp.py']

png

Controlling the words that get into the clouds

Of course, like all good tool, it's simple to use, but you can bend it in many ways, to do many things.

For example, say you didn't like the clouds above. There are too many code bits, underscored words, code-specific names (e.g. "self"). They don't really represent what's in the module. You'd rather that the word clouds have english words in them. Perhaps it's better to not just use the code, but only the doc strings therein...

# Make a function that will extract the docstrings from a Python file
def just_the_docs(python_code, sep='\n'):
    """Extract the docstrings from a Python file.
    All the docstrings: The module's, the classes', the functions'.
    Then concatenate them all into one big string.
    """
    import ast
    module = ast.parse(python_code)
    docstrings = []
    for node in ast.walk(module):
        if isinstance(node, ast.FunctionDef):
            docstrings.append(ast.get_docstring(node))
        elif isinstance(node, ast.ClassDef):
            docstrings.append(ast.get_docstring(node))
    aggregate = sep.join(filter(None, docstrings))
    return aggregate.lower()  # also, let's lower case everything (to count "The" and "the" as the same word)

# Use this little dol tool...
from dol import wrap_kvs
# ... to make a new key-value store that, instead of giving you the contents of the files,
# gives you the docstrings therein. 
guise_doc_files = wrap_kvs(guise_files, value_decoder=just_the_docs)

# Now, instead of using the default str_to_words, we'll only extract sequences of letters as words
# (no numbers or underscores, which is the default). 

guise_doc_clouds = word_cloud_store(guise_doc_files, str_to_words='[a-zA-Z]+')
# Note: this is equivalent to using str_to_words=re.compile('[a-zA-Z]+').findall
#   or str_to_words=lambda x: re.compile('[a-zA-Z]+').findall(x)

And now gaze at the clouds...

guise_doc_clouds['word_clouds.py']

png

guise_doc_clouds['nlp.py']

png

Controlling the weight of the words in the cloud

You probably noticed that a significant amount of visual space is taken by common words.

Words like "the", "of", "to" etc. don't really convey much information. Further, maybe the docs still contain some words you'd like to ignore. The NLP concept of "stopwords" takes care of this.

Finally, to really nail the "this word cloud gives us a snap shot of the what the underlying text is about, we probably want to weigh the words, yes, according to how frequently they're used in the text, but also lower the weight according to how frequently they're used in normal language, to really highlight what is particular about the text. This is just the idea of "stop words" made "smooth" (no black/white decision, but a scale).

from guise import STOPWORDS, TFIDFCalculator

# Make a list of stopwords (words to ignore) that includes the default ones plus a few more
my_stopwords = (
    list(STOPWORDS)  # the usual stopwords in English
    + ['dict', 'list', 'pandas', 'series']   # edit this to add or remove more
)

better_guise_doc_clouds = word_cloud_store(
    guise_doc_files, 
    str_to_words='[a-zA-Z]+',
    # remove the stopwords from the word cloud completely
    stopwords=my_stopwords,
    # Use "normal" English word frequencies to weigh the words in such a way that 
    # common words are less important
    words_to_weights=TFIDFCalculator('english')  
)

better_guise_doc_clouds['word_clouds.py']

png

better_guise_doc_clouds['nlp.py']

png

The `word_cloud` function

The word_cloud_store function uses the word_cloud function to do it's job.

And the word word_cloud function just adds some useful parameters to the already excellent wordcloud package (wordcloud documentation).

The main raison d'être of word_cloud over the already existing tools of the wordcloud package is that I wanted to make easier to make parametrized text-sources-to-word-clouds pipelines. Namely, be able to control better the extraction of text, the corresponding words (a.k.a. "tokenization") and weighting of these words.

from guise import word_cloud

list_of_words = ['Rose', 'rose', 'to', 'put', 'rose', 'roes', 'on', 'her', 'rows', 'of', 'roses']
a_word_cloud_obj = word_cloud(list_of_words)

a_word_cloud_obj is a WordCloud object, from which you can do all kinds of things. See word_cloud's documentation for more information. The two main methods we might be interested here are the to_file(...) methods and to_image() method. The later will give you a PIL.Image object, which namely, displays in a notebook if it's the last statement in the cell.

a_word_cloud_obj.to_image()

png

Transforming the output

The wc_decoder argument lets you transform the output on the fly. So if you wanted to get a PIL image object directly, you could do this:

word_cloud(list_of_words, wc_decoder=lambda x: x.to_image())

png

Always getting the same output

Notice that the output was different, though the input was the same. That's because there's some randomness builtin the process. Randomness is often a good thing for creative processes, but if you need consistent outputs, you can fix the random_state.

word_cloud(list_of_words, wc_decoder=lambda x: x.to_image(), random_state=42)

png

word_cloud(list_of_words, wc_decoder=lambda x: x.to_image(), random_state=42)

png

Fixing your parameters

Once you know what you want your parameters to be, you can fix them by calling the function without the first words input, and you'll get a function (it's called a "partial" function) with those parameters you specified fixed.

For example, let's make a word_cloud_image that has the wc_decoder and random_state fixed:

word_cloud_image = word_cloud(wc_decoder=lambda x: x.to_image(), random_state=42)

word_cloud_image(list_of_words)

png

Specifying word weights explicitly

Note that the words are more prominent, depending on the number of times it is repeated. You can control this "word weight" directly if you specify a {word: weight, ...} dict input.

word_cloud_image({'little': 0.2, 'BIG': 20})

png

# Q: Why does BIG have a weight 100 times little, yet doesn't appear 100 times bigger?
# A: Because of wordcloud internal defaults meant for general aesthetics. You can control this with extra arguments. For example:
word_cloud({'little': 0.2, 'BIG': 20}, min_font_size=1, max_font_size=1000).to_image()

png

Input simple text

If you input just a string, word_cloud will both extract the words and (count-)weight them for you:

word_cloud_image('That that is is that that is not is not is that it it is.')

png

Tokenization (from text to words)

quote_1 = "Success is not-final, failure is not-fatal: it is the courage-to-continue that counts."  # – Winston Churchill
word_cloud_image(quote_1)

png

Note how "not-final" was split into two words, "not" and "final", and that no punctuation was included. word_cloud has it's own default way of breaking the text into words (this is called "tokenization"). But you can control this by specifying your own str_to_words function.

Say you want to just split over spaces to get your "words" (really "tokens", because you'll get punctuation, composed words, etc.):

word_cloud_image(quote_1, str_to_words=lambda x: x.split(' '))

png

If you specify a string as your str_to_words, it will be considered as a regular expression that defines what a matching word should be. Here we define a word to be any uninterrupted sequence of letters, numbers or underscores (that's the \w part) or dashes.

word_cloud_image(quote_1, str_to_words='[\w-]+')

png

Control the weighting of the words

Of course, you can always take your text, tokenize it into words, then put those words through a weighting process, then give it to word_cloud. But if you want that process to happen automatically when given some text, ou can specify str_to_words, as we've seen above, for the tokenization, and can also specify the word weighting function through the words_to_weights argument.

james_had = "James, while John had had 'had,' had had 'had had'; 'had had' had had a better effect on the teacher."

# Let's weight the words by the parity of their length:
word_cloud_image(james_had, words_to_weights=lambda x: 1 + (len(x) % 2))

png

That words_to_weights was a silly one, just for fun. Usually, you'll put a bit more work in to get a word weighting function that is appropriate for your narrative. The guise package offers some tools for making these in the guise.nlp module.

For example, we have the TFIDFCalculator, with which you can weigh words "up" as their word count increases (the TF part) but also weight them "down" as those words are considered more common in the particular context you're looking at. The tool even offers a convenient way to just compare to "general english langauge usage", using statistics over 333K+ words of a large English corpus.

from guise import TFIDFCalculator

words_to_weights = TFIDFCalculator('english')
word_cloud_image(james_had, words_to_weights=words_to_weights)

png

Project details

Release history Release notifications | RSS feed

This version

0.0.8

Nov 29, 2024

0.0.7

Nov 29, 2024

0.0.6

Nov 29, 2024

0.0.5

Nov 29, 2024

0.0.4

Nov 28, 2024

0.0.3

Jan 3, 2022

0.0.2

Dec 4, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

guise-0.0.8.tar.gz (24.3 kB view details)

Uploaded Nov 29, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

guise-0.0.8-py3-none-any.whl (22.0 kB view details)

Uploaded Nov 29, 2024 Python 3

File details

Details for the file guise-0.0.8.tar.gz.

File metadata

Download URL: guise-0.0.8.tar.gz
Upload date: Nov 29, 2024
Size: 24.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for guise-0.0.8.tar.gz
Algorithm	Hash digest
SHA256	`fb6bcdd59c03b08469b4d6822716a5ee28ab5a332c4b847940df5c5a0102c986`
MD5	`d8e80674679b8972c8e3388b5ed38516`
BLAKE2b-256	`592f7f3aa63f49986573f16fc9ee3457c1281ff77445103cc409e7e99a183b49`

See more details on using hashes here.

File details

Details for the file guise-0.0.8-py3-none-any.whl.

File metadata

Download URL: guise-0.0.8-py3-none-any.whl
Upload date: Nov 29, 2024
Size: 22.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for guise-0.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`74ecd80b8c79407fa6c83a63eb58cbe1a9fb065a9729bd4ff478daf836bfd3e4`
MD5	`5069be142175a35de770688b9a8f2101`
BLAKE2b-256	`72f80372f585d61cd7cf1c8ebcddd4d1f133c3d9c0c2d6d407b2014d22e09c14`

See more details on using hashes here.

guise 0.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

guise

Overview

The `word_cloud_store` function

Clouding the files of `guise`

Controlling the words that get into the clouds

Controlling the weight of the words in the cloud

The `word_cloud` function

Transforming the output

Always getting the same output

Fixing your parameters

Specifying word weights explicitly

Input simple text

Tokenization (from text to words)

Control the weighting of the words

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

guise 0.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

guise

Overview

The word_cloud_store function

Clouding the files of guise

Controlling the words that get into the clouds

Controlling the weight of the words in the cloud

The word_cloud function

Transforming the output

Always getting the same output

Fixing your parameters

Specifying word weights explicitly

Input simple text

Tokenization (from text to words)

Control the weighting of the words

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

The `word_cloud_store` function

Clouding the files of `guise`

The `word_cloud` function