Skip to main content

STriP Net: Semantic Similarity of Scientific Papers (S3P) Network

Project description

💡 STriP Net: Semantic Similarity of Scientific Papers (S3P) Network

Python PyPI Downloads DOI GitHub - License Conda - Platform Conda (channel only) Conda Recipe

Do you read a lot of Scientific Papers? Have you ever wondered what are the overarching themes in the papers that you've read and how all the papers are semantically connected to one another? Look no further!

Leverage the power of NLP Topic Modeling, Semantic Similarity, and Network analysis to study the themes and semantic relations within a corpus of research papers.

✅ Generate STriP Network on your own collection of research papers with just three lines of code!

✅ Interactive plots to quickly identify research themes and most important papers

✅ This repo was hacked together over the weekend of New Year 2022. This is only the initial release, with lots of work planned.

💪 Please leave a ⭐ to let me know that STriP Net has been helpful to you so that I can dedicate more of my time working on it.

⚡ Install

Install with conda

This is perhaps the most hasslefree option for installing stripnet with conda.

conda install -c conda-forge stripnet

Install with pip

If you want to install stripnet using pip, it is highly recommend to install in a conda environment.

  • Create a conda environment (here we choose the environment name as stripnet) and activate it.
conda create -n stripnet python=3.8 jupyterlab -y
conda activate stripnet
  • Pip install this library
pip install stripnet

🔥🚀 Generate the STriP network analysis on default settings

  • STriP can essentially run on any pandas dataframe column containing text.
  • However, the pretrained model is hardcoded (for now), so you'll see the best results while running it on a column that combines the title and abstract of papers separated by [SEP] keyword. Please see below
# Load some data
import pandas as pd
data = pd.read_csv('data.csv')

# Keep only title and abstract columns
data = data[['title', 'abstract']]

# Concat the title and abstract columns separated with [SEP] keyword
data['text'] = data['title'] + '[SEP]' + data['abstract']
# Instantiate the StripNet
from stripnet import StripNet
stripnet = StripNet()

# Run the StripNet pipeline
stripnet.fit_transform(data['text'])
  • If everything ran well, your browser should open a new window with the network graph similar to below. The graph is fully interactive! Have fun playing around by hovering over the nodes and moving them around!
  • If you are not satisfied with the topics you get, just restart the kernel and rerun it. The Topic Modeling framework has some level of randomness so the topics will change slightly with every run.
  • You can also tweak the paremeters of the various models, please look out for the full documentation for the details!

STriP Network

🏅 Find the most important paper

  • After you fit the model using the above steps, you can plot the most important papers with one line of code
  • The plot is fully interactive too! Hovering over any bar shows the relevant information of the paper.
stripnet.most_important_docs()

Most Important Text

🛠️ Common Issues

  1. If your StripNet graph is just one big ball of moving fireflies, try these steps
    • Check the value of threshold currently used by stripnet
    current_threshold = stripnet.threshold
    print(current_threshold)
    
    • Increase the value of threshold in steps of 0.05 and try again until you see a good looking network. Remember the max value of threshold is 1! If you're threshold is already 0.95 then try increasing in steps of 0.01 instead.
    stripnet.fit_transform(data['text'], threshold=current_threshold+0.05)
    
  2. If you're dataset is small (<500 rows) and the number of topics generated seems too less
    • Try tweaking the value of min_topic_size to a value lower than the default value of 10 until you get topics that look reasonable to you
    stripnet.fit_transform(data['text'], min_topic_size=5)
    
  3. After the above two steps, if your graph looks messy, try removing isolated nodes (those nodes that don't have any connections)
    stripnet.fit_transform(data['text'], remove_isolated_nodes=True)
    
  4. In practice, you might have to tweak all three at the same time!
    stripnet.fit_transform(data['text'], threshold=current_threshold+0.05, min_topic_size=5, remove_isolated_nodes=True)
    

I'm testing out the network on a variety of data to pick better default values. Do let me know if some specific values worked the best for you!

🎓 Citation

To cite STriP Net in your work, please use the following bibtex reference:

@software{marie_stephen_leo_2022_5823822,
  author       = {Marie Stephen Leo},
  title        = {STriP Net: Semantic Similarity of Scientific Papers (S3P) Network},
  month        = jan,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {v0.0.5.zenodo},
  doi          = {10.5281/zenodo.5823822},
  url          = {https://doi.org/10.5281/zenodo.5823822}
}

🤩 Acknowledgements

STriP Net stands on the shoulder of giants and several prior work. The most notable being

  1. Sentence Transformers [Paper] [Code]
  2. AllenAI Specter pretrained model [Paper] [Code]
  3. BERTopic [Code]
  4. Networkx [Code]
  5. Pyvis [Code]

🙏 Buy me a coffee

If this work helped you in any way, please consider the following way to give me feedback so I can spend more time on this project

  1. ⭐ this repository
  2. ❤️ the Huggingface space
  3. 👏 the Medium post (Coming End Jan 2022!)
  4. Buy me a Coffee!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stripnet-0.0.7.tar.gz (12.4 kB view details)

Uploaded Source

Built Distributions

stripnet-0.0.7-py3.7.egg (16.0 kB view details)

Uploaded Source

stripnet-0.0.7-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file stripnet-0.0.7.tar.gz.

File metadata

  • Download URL: stripnet-0.0.7.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.11

File hashes

Hashes for stripnet-0.0.7.tar.gz
Algorithm Hash digest
SHA256 40abb57cc734361d8e3c9178b844b7549f5ea829ea02b2353c7bc82025994bf2
MD5 2452010777333363e966a052edb4e798
BLAKE2b-256 45bce7452c4533e685a736dfd2e74239f57a7e8f579b7c17b28794d7dcb148e0

See more details on using hashes here.

File details

Details for the file stripnet-0.0.7-py3.7.egg.

File metadata

  • Download URL: stripnet-0.0.7-py3.7.egg
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.11

File hashes

Hashes for stripnet-0.0.7-py3.7.egg
Algorithm Hash digest
SHA256 9a2c69b6ec53204d2ea686f839454250a34cddd99d9260475215ac8a784b182b
MD5 0712e304b670efab00aafa19bc31c98c
BLAKE2b-256 98b7a3900f7f303e62edd4a142696e8c8606e8b632524c570b0f8586c7503ac5

See more details on using hashes here.

File details

Details for the file stripnet-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: stripnet-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.11

File hashes

Hashes for stripnet-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 26c2aebc965662215f808c51c367d9a7f638117e83fbb0a9db84c0686f53ec14
MD5 67198c4f4329b20d53764d16304e3ee0
BLAKE2b-256 0715e2abc095980074fcc98f15979d3ad7c5ace2b9f62fd6ff965ff600588dca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page