Skip to main content

Widget for exploring and sampling words from text data through w2v

Project description

w2widget

Widget for exploring and sampling words from text data through word2vec models in order to construct topic dictionaries.

Package content

The w2widget package contains two modules:

  • doc2vec.py
  • widget.py

Examples

In the widget_example.ipynb you can play with the widget from pretrained data from Reuters dataset.

If you want to see an example of the data-workflow generating the necessary input, check out workflow_example.ipynb.

Doc2Vec

This module helps with calculating and handling doc2vec. The approach applied is that every document's vector is calculated by taking a weighted (ie. based on inverse frequencies) average of the document's word vectors.

from w2widget.doc2vec import calculate_inverse_frequency, Doc2Vec

# Calculate word weigts from inverse frequency
word_weights = calculate_inverse_frequency(document_tokens)

# Initiate the model
dv_model = Doc2Vec(wv_model, word_weights)

# Add documents and calculated the document vectors
dv_model.add_doc2vec(document_tokens)

# reduce the dimensions
dv_model.reduce_dimensions()

# Store the embeddings
two_dim_doc_embedding = dv_model.TSNE_embedding_array

Widget

This widget module displays the results from:

  • A gensim word2vec model,
  • it's 2-dimensional embedding (ie. TSNE).
  • The custom implemented doc2vec model,
  • it's 2-dimensional embedding (ie. TSNE).
  • A list of tokenized documents with whitespaces and
  • optionally a list of initial search words
from w2widget.widget import Widget

wv_widget = Widget(
    wv_model,
    two_dim_word_embedding,
    tokens_with_ws
    dv_model=None,
    two_dim_doc_embedding=None,
    initial_search_words=[],
)

wv_widget.display_widget()

You can save the topics to a json file from the widget, or access them from the dictionary stored in wv_widget.topics.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

w2widget-0.0.3.tar.gz (13.2 kB view hashes)

Uploaded Source

Built Distribution

w2widget-0.0.3-py3-none-any.whl (13.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page