Skip to main content

CLI to cluster scientific papers

Project description

ClusterPub

ClusterPub is a tool developed to help researchers in their processes of bibliographic review, helping them to find papers related to their areas of interest, based on search results returned by papers repositories, like, IEEE Xplore and Pubmed.

Instalation 🛠

To install and execute ClusterPub it is necessary to have Python 3.11 or above installed.

Run ClusterPub 🚀

To execute ClusterPub run the following command:

Cluster publications present in a bibliographic file

cluster-pub {source_file} {result_file}

OBS: The result_file name should contain the desired extension.

The allowed extensions for the source file are:

  • NBIB
  • RIS
  • BibTex

The allowed extensions for the result file are:

  • EPS
  • JPEG
  • PDF
  • PGF
  • PNG
  • PS
  • Raw (Binary)
  • RGBA
  • SVG
  • SVGZ
  • TIF
  • TIFF
  • Webp

To obtain help about the parameters and options available execute the following command:

cluster-pub --help

There is a folder in the project directory called sample_files, containing files that could be used to execute tests.

Extract Clustering Metrics 📈

To calculate clustering metrics, like, Silhouette Score, Davies-Bouldin Score and Calinski-Harabasz Score run the following commands:

OBS: The argument number_of_clusters is not the desired clusters quantity, but it is the quantity of clusters/categories that might exit in the analysed dataset.

Calculate Davies-Bouldin Score

cluster-pub-metrics davies-bouldin-score {source_file} {number_of_clusters}

Calculate Calinski-Harabasz Score

cluster-pub-metrics calinski-harabasz-score {source_file} {number_of_clusters}

Calculate Silhouette Score

cluster-pub-metrics silhouette-score {source_file} {number_of_clusters} --distance-metric={distance_metric}

To obtain help for the score commands listed above run the following command:

cluster-pub-metrics {score_command} -- help

Background Information 🔍

The default hyperparameters and algorithms used in this project are:

  • Word Embeddings Technicque: Hash2Vec
  • Dimensionality Reduction Technicque: SVD
  • Number of singular values used in SVD: 8
  • Clustering Algorithm: Hierarchical Clustering
  • Distance Metric: Cosine Similarity
  • Linkage Method: Weighted
  • Supported Languages: English

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cluster_pub-0.2.2.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

cluster_pub-0.2.2-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file cluster_pub-0.2.2.tar.gz.

File metadata

  • Download URL: cluster_pub-0.2.2.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/5.15.153.1-microsoft-standard-WSL2

File hashes

Hashes for cluster_pub-0.2.2.tar.gz
Algorithm Hash digest
SHA256 e671f5c788bbaa1fd6cbfe69a67d42c84a0a191d7743d42c70391bf22070b45e
MD5 172774299d7755d4a982f4bbe5ea1a86
BLAKE2b-256 e6e3db952a93dae67e55cf1ff3c0c0da3ff30130419643e9867524e1ec2f556a

See more details on using hashes here.

File details

Details for the file cluster_pub-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: cluster_pub-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/5.15.153.1-microsoft-standard-WSL2

File hashes

Hashes for cluster_pub-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f9d353fcb5ac5dde8dacfd810e2258b3d6b73d366ca6f4e86c917f834520e074
MD5 e75f0ff780e1c02db5b2da9e463cffcb
BLAKE2b-256 5559eae0159455f0758f61ec37d5789382956fee361d0ad4681aeaae56ead16c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page