Skip to main content

Topic2Vector learns jointly embedded topic, document and word vectors.

Project description

Top2Vec

Topic2Vector is an algorithm for topic modeling. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can:

  • Get number of detected topics.
  • Get topics.
  • Search topics by keywords.
  • Search documents by topic.
  • Find similar words.
  • Find similar documents.

Benefits

  1. Automatically finds number of topics.
  2. No stop words required.
  3. No need for stemming/lemmatizing.
  4. Works on short text.
  5. Creates jointly embedded topic, document, and word vectors.
  6. Has search functions built in.

How does it work?

The assumption the algorithm makes is that many semantically similar documents are indicative of an underlying topic. The first step is to create a joint embedding of document and word vectors. Once documents and words are embedded in a vector space the goal of the algorithm is to find dense clusters of documents, then identify which words attracted those documents together. Each dense area is a topic and the words that attracted the documents to the dense area are the topic words.

The Algorithm:

1. Create jointly embedded document and word vectors using Doc2Vec.

Documents will be placed close to other similar documents and close to the most distinguishing words.

Joint Document and Word Embedding

2. Create lower dimensional embedding of document vectors using UMAP.

Document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas. Each point is a document vector.

UMAP dimension reduced Documents

3. Find dense areas of documents using HDBSCAN.

The colored areas are the dense areas of documents. Red points are outliers that do not belong to a specific cluster.

HDBSCAN Document Clusters

4. For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector.

The red points are outlier documents and do not get used for calculating the topic vector. The purple points are the document vectors that belong to a dense area, from which the topic vector is calculated.

Topic Vector

5. Find n-closest word vectors to the resulting topic vector

The closest word vectors in order of proximity become the topic words.

Topic Words

Installation

The easy way to install Top2Vec is:

pip install top2vec

Usage

from top2vec import Top2Vec

model = Top2Vec(documents)

Parameters:

  • documents: Input corpus, should be a list of strings.

  • speed: This parameter will determine how fast the model takes to train. The 'fast-learn' option is the fastest and will generate the lowest quality vectors. The 'learn' option will learn better quality vectors but take a longer time to train. The 'deep-learn' option will learn the best quality vectors but will take significant time to train.

  • workers: The amount of worker threads to be used in training the model. Larger amount will lead to faster training.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

top2vec-1.0.3.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

top2vec-1.0.3-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file top2vec-1.0.3.tar.gz.

File metadata

  • Download URL: top2vec-1.0.3.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for top2vec-1.0.3.tar.gz
Algorithm Hash digest
SHA256 d8353780633bd300a0f656e7e760c4c4309decb6986d3cdc0f6a89f6cf41bcd9
MD5 8239999c5781d8ea9ae05ff27652276d
BLAKE2b-256 94ec321ecabd5231cdc2a52d1f0865f99a6c3afe8deca2ca05a30798ff6fb601

See more details on using hashes here.

File details

Details for the file top2vec-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: top2vec-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for top2vec-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4603679c04c33e8a5ab582d4cbf234907d3648a94761836b911f4060f0cf41f9
MD5 33df975d0a2aa8fcd7732f261d13ffc2
BLAKE2b-256 b99ec4dafd6a166933a242f9f16eb1640fa44ec8ba59ff0a8d1e8378ea8a74a8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page