Corpus-Show makes it easier and faster to visualize corpus through sentence embedding of corpus.
Project description
Corpus-Show
Corpus-Show helps to understand the corpus data distribution through various values generated from Sentence Transformer.
- Corpus-Show performs sentence embedding via Sentence Transformers, a Python framework for state-of-the-art sentence, text and image embeddings. [Paper] [Document] [Huggingface model]
- You can visualize the embedded sentences of each document generated from SentenceTransformers.
- Corpus-Show can also generate clusters with sentences embedded array through Scikit-Learn KMeans.
- The sentence transformer model is downloaded through the hugging face interface, and the default model is set to
paraphrase-xlm-r-multilingual-v1
, which supports multiple languages. However, you can easily input your custom model as a sentence transformer model through the hugging face interface. It is also easy to fine-tune via SBERT. For more models, please see this page.
Installation
pip install corpusshow
Tutorial
We provide tutorial notebooks for all the features we offer. We plan to provide additional docstrings or documentation from the official distribution version (major version 1 or higher).
- Main-tutorials: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/corpusshow_tutorial.ipynb
- Sub-tutorial-folder: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials
Main Feature
It helps to create a simple but useful plot as shown below with a simple dataframe and column names as input, such as the following BBC sample dataset in ./data/bbc_news_dataset.csv
.
news | topic | |
---|---|---|
0 | Oil rebounds from weather effect (...) | business |
1 | Indonesia 'declines debt freeze' (...) | business |
... | ... | ... |
601 | EU software patent law faces axe (...) | tech |
1. CorpusClster
Contains 1 static method. You can create great pictures with:
from corpusshow import CorpusCluster
# Class arguments
csv_file_path = '../data/bbc_news_dataset.csv'
sentence_transformer_model_name = 'paraphrase-xlm-r-multilingual-v1'
target_col = 'news'
num_cluster = 4
# Get class object
cc = CorpusCluster(csv_file_path, sentence_transformer_model_name, target_col, num_cluster)
# 1. quick_corpus_show method:
# Show figures without k-means clustering
cc.quick_corpus_show('topic', 'tsne2d', False, 'fig1.png')
cc.quick_corpus_show('topic', 'tsne3d', False, 'fig2.png')
cc.quick_corpus_show('topic', 'pca2d', False, 'fig3.png')
cc.quick_corpus_show('topic', 'pca3d', False, 'fig4.png')
# 2. quick_cluster_show method:
# Show figures with k-means clustering
df_returned = cc.quick_cluster_show('tsne2d', False, 'fig5.png')
df_returned = cc.quick_cluster_show('tsne3d', False, 'fig6.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig7.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig8.png')
- If you want to change the design of the plot, use matplotlib's RcParams method or the returned dataframe.
References
[1] Scikit-Learn https://scikit-learn.org
[2] Matplotlib https://matplotlib.org/
[3] Huggingface Sentence Transformer https://huggingface.co/sentence-transformers
[4] SBERT https://www.sbert.net/
Use Case
[1] Korean-news-topic-classification-using-KO-BERT: all plots were created through Corpus-Show and Quick-Show.
Contacts
Maintainer: Daniel Park, South Korea e-mail parkminwoo1991@gmail.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file corpusshow-0.1.4.tar.gz
.
File metadata
- Download URL: corpusshow-0.1.4.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c2a963940c7370d8f121086394265abcd7a11e95bbe681cb5bf8a29ff5f61e0e |
|
MD5 | 95ca677a5410a91cd10ce87c99712e83 |
|
BLAKE2b-256 | f14bd476d5a2231f74eed54ffbcd80b5f796b8a4ffefaed82e81fd6545718315 |
File details
Details for the file corpusshow-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: corpusshow-0.1.4-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c9683fb42caf521a2885dde062784ebe563056a559fc83bb53fa9a4f184286e |
|
MD5 | fe98999d48c391fb7cfa8bf9c0a02c1a |
|
BLAKE2b-256 | da69ed10f9ff0123f8b8905641ce2530d4d646a00b6b7ce15c22592c382dad87 |