Corpus-Show makes it easier and faster to visualize corpus through sentence embedding of corpus.
Project description
Corpus-Show
Corpus-Show helps to understand the corpus data distribution through various values generated from Sentence Transformer.
- Corpus-Show performs sentence embedding via Sentence Transformers, a Python framework for state-of-the-art sentence, text and image embeddings. [Paper] [Document] [Huggingface model]
- You can visualize the embedded sentences of each document generated from SentenceTransformers.
- Corpus-Show can also generate clusters with sentences embedded array through Scikit-Learn KMeans.
- The sentence transformer model is downloaded through the hugging face interface, and the default model is set to
paraphrase-xlm-r-multilingual-v1
, which supports multiple languages. However, you can easily input your custom model as a sentence transformer model through the hugging face interface. It is also easy to fine-tune via SBERT. For more models, please see this page.
Installation
pip install corpusshow
- This package may not work properly in m1/m2 mac environment. If you are using Mac m1/m2, please use the git repository as a submodule because it has minimal encapsulation. [issue#1]
Tutorial
We provide tutorial notebooks for all the features we offer. We plan to provide additional docstrings or documentation from the official distribution version (major version 1 or higher).
- Main-tutorials: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/corpusshow_tutorial.ipynb
- Sub-tutorial-folder: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials
Main Feature
It helps to create a simple but useful plot as shown below with a simple dataframe and column names as input, such as the following BBC sample dataset in ./data/bbc_news_dataset.csv
.
news | topic | |
---|---|---|
0 | Oil rebounds from weather effect (...) | business |
1 | Indonesia 'declines debt freeze' (...) | business |
... | ... | ... |
601 | EU software patent law faces axe (...) | tech |
1. CorpusClster
Contains 1 static method. You can create great pictures with:
from corpusshow import CorpusCluster
# Class arguments
csv_file_path = '../data/bbc_news_dataset.csv'
sentence_transformer_model_name = 'paraphrase-xlm-r-multilingual-v1'
target_col = 'news'
num_cluster = 4
# Get class object
cc = CorpusCluster(csv_file_path, sentence_transformer_model_name, target_col, num_cluster)
# 1. quick_corpus_show method:
# Show figures without k-means clustering
cc.quick_corpus_show('topic', 'tsne2d', False, 'fig1.png')
cc.quick_corpus_show('topic', 'tsne3d', False, 'fig2.png')
cc.quick_corpus_show('topic', 'pca2d', False, 'fig3.png')
cc.quick_corpus_show('topic', 'pca3d', False, 'fig4.png')
# 2. quick_cluster_show method:
# Show figures with k-means clustering
df_returned = cc.quick_cluster_show('tsne2d', False, 'fig5.png')
df_returned = cc.quick_cluster_show('tsne3d', False, 'fig6.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig7.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig8.png')
- If you want to change the design of the plot, use matplotlib's RcParams method or the returned dataframe.
References
[1] Scikit-Learn https://scikit-learn.org
[2] Matplotlib https://matplotlib.org/
[3] Huggingface Sentence Transformer https://huggingface.co/sentence-transformers
[4] SBERT https://www.sbert.net/
Use Case
[1] Korean-news-topic-classification-using-KO-BERT: all plots were created through Corpus-Show and Quick-Show.
Contacts
Maintainer: Daniel Park, South Korea e-mail parkminwoo1991@gmail.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for corpusshow-0.1.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2505fabf1ad19432bf399a91f7c28e1b544af1235971a6837d90c4dbc898b40b |
|
MD5 | 9e3bd54a186bfa59a93dba3f2b97f386 |
|
BLAKE2b-256 | 25cde20c32a27eb3c16222088226d7c24174337c893407c98301aa080c095b5f |