Skip to main content

Corpus-Show makes it easier and faster to visualize corpus through sentence embedding of corpus.

Project description

Corpus-Show

Contributor Covenant Python Version Pypi Version Code convention

Corpus-Show helps to understand the corpus data distribution through various values generated from Sentence Transformer. (It's not such a great package, but It simply helps you visualize comfortably.)

  • Corpus-Show performs sentence embedding via Sentence Transformers, a Python framework for state-of-the-art sentence, text and image embeddings. [Paper] [Document] [Huggingface model]
  • You can visualize the embedded sentences of each document generated from SentenceTransformers.
  • Corpus-Show can also generate clusters with sentences embedded array through Scikit-Learn KMeans.
  • The sentence transformer model is downloaded through the hugging face interface, and the default model is set to paraphrase-xlm-r-multilingual-v1, which supports multiple languages. However, you can easily input your custom model as a sentence transformer model through the hugging face interface. It is also easy to fine-tune via SBERT. For more models, please see this page.

Installation

pip install corpusshow
  • This package may not work properly in m1/m2 mac environment. If you are using Mac m1/m2, please use the git repository as a submodule because it has minimal encapsulation. [issue#1]

Tutorial

We provide tutorial notebooks for all the features we offer. We plan to provide additional docstrings or documentation from the official distribution version (major version 1 or higher).

  1. Main-tutorials: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/corpusshow_tutorial.ipynb
  2. Sub-tutorial-folder: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials





Main Feature

It helps to create a simple but useful plot as shown below with a simple dataframe and column names as input, such as the following BBC sample dataset in ./data/bbc_news_dataset.csv.

news topic
0 Oil rebounds from weather effect (...) business
1 Indonesia 'declines debt freeze' (...) business
... ... ...
601 EU software patent law faces axe (...) tech

1. CorpusClster

Contains 1 static method. You can create great pictures with:

from corpusshow import CorpusCluster

# Class arguments
csv_file_path = '../data/bbc_news_dataset.csv'
sentence_transformer_model_name = 'paraphrase-xlm-r-multilingual-v1'
target_col = 'news'
num_cluster = 4

# Get class object
cc = CorpusCluster(csv_file_path, sentence_transformer_model_name, target_col, num_cluster)

# 1. quick_corpus_show method: 
# Show figures without k-means clustering
cc.quick_corpus_show('topic', 'tsne2d', False, 'fig1.png')
cc.quick_corpus_show('topic', 'tsne3d', False, 'fig2.png')
cc.quick_corpus_show('topic', 'pca2d', False, 'fig3.png')
cc.quick_corpus_show('topic', 'pca3d', False, 'fig4.png')

# 2. quick_cluster_show method:
# Show figures with k-means clustering
df_returned = cc.quick_cluster_show('tsne2d', False, 'fig5.png')
df_returned = cc.quick_cluster_show('tsne3d', False, 'fig6.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig7.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig8.png')

  • If you want to change the design of the plot, use matplotlib's RcParams method or the returned dataframe.

References

[1] Scikit-Learn https://scikit-learn.org
[2] Matplotlib https://matplotlib.org/
[3] Huggingface Sentence Transformer https://huggingface.co/sentence-transformers
[4] SBERT https://www.sbert.net/


Use Case

[1] Korean-news-topic-classification-using-KO-BERT: all plots were created through Corpus-Show and Quick-Show.

Contacts

Maintainer: Daniel Park, South Korea e-mail parkminwoo1991@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpusshow-0.1.7.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

corpusshow-0.1.7-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file corpusshow-0.1.7.tar.gz.

File metadata

  • Download URL: corpusshow-0.1.7.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for corpusshow-0.1.7.tar.gz
Algorithm Hash digest
SHA256 436b0ad0500ca086e1f72e83be31e39449a1ef19fdcd3d666a69abe5263db7e4
MD5 919a09dc6d77c9de477a9531669192c3
BLAKE2b-256 f8de2f13fa77082fae6bd63d19155a53923123d2fbcfb99a9c4ec85677854020

See more details on using hashes here.

File details

Details for the file corpusshow-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: corpusshow-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for corpusshow-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 6bba98158e9cdd7938087a9d3c83cd91450b71b48e0626a5b574ca6ea315f1b5
MD5 bb7d5111a13e45b089c50ef90663660c
BLAKE2b-256 17bbfa81ff474a8b625d91737936afd7fbcf89424fd2715b974521d618b0eb96

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page