Skip to main content

A library for classifying texts into one or more of the 17 SDGs using a pretrained transformer-based model or keyword extraction method, with an option to combine both approaches.

Project description

SDGDetector

License

CyVer

SDGDetector is an open-source python library, that aims to classify texts with the Sustainable Development Goals (SDGs). This library either uses a pretrained fine-tuned model to classifiy the given texts to the SDG or implements the method of keywords extraction to associate them with the SDG or combine the two aforementioned methods.

  1. The first method takes as input a list of texts and a fine-tuned XLNet or RoBERTa model and returns the probabilities of the given texts to be associated with the SDG. The training/testing f1-score of the models is 0.90. XLNet is better than RoBERTa.
  2. The second method takes as input a list of texts and find the most relevant keywords and computes the cosine similarity with the keywords of the SDG. As keywords of the SDG, we use the keywords based on the methodology explained here and we add as keywords for the SDG17 the keywords which can be found here.
  3. The third option of this libary is a combination of the two above methods. The formula is
CyVer

The methodology used for this library is available in SustaiNLP Gitlab repository

Features

  • SDG_classifier_using_model: Classifies the given text with the SDGs by using the fine-tuned XLNet or RoBERTa models.
  • SDG_classifier_using_keywords_extraction: Classifies the given text with the SDGs by using keyword extraction and sentence embeddings generated by one of the models: all-mpnet-base-v2,distilbert-base-nli-mean-tokens, all-MiniLM-L6-v2. Representative keywords are identified through the MMR algorithm and compared to SDG keywords using cosine similarity.
  • SDG_classifier: Classifies the given text with the SDGs by combining the aforementioned methods.

⚙️ Installation

To install the latest stable version from PyPI, run:

pip install SDGDetector

Alternatively, if you prefer to install the latest development version, you can install it directly from GitLab:

pip install git+https://gitlab.com/netmode/sdg-detector.git

Or, you can manually clone the repository and run:

git clone https://gitlab.com/netmode/sdg-detector.git
cd sdg-detector
pip install .

📦 Prerequisites

Before installing, make sure you have the following Python packages installed, with Python version >= 3.11:

'numpy>=1.26.4',  
'nltk>=3.9.1',  
'sentence-transformers>=3.4.1', 
'sentencepiece>=0.2.0', 
'keras_preprocessing>=1.1.2'

📖Documentation

For documentation read Wiki.

💻 Example Usage

In the file Test.ipynb there are examples for the 3 different classes of this library.

The fine-tuned pretrained XLNet and RoBERTa models, which are used in the first class SDG_classifier_using_model can be found here under the folder 'Data/Classification Task-Transfer Learning'. The training/testing f1-score of the models is 0.90. XLNet is better than RoBERTa. In addition, this library can be used with the user's fine-tuned model. The requirements for a different fine-tuned model is:

  • It should be 'BERT' or 'XLNet' model
  • It should be saved using the python code torch.save(model.state_dict(), model_name) and implemented using Pytorch.

Importing the Library

import os
from SDGDetector import SDGDetector

text = ['Europe has always been the home of industry. For centuries, it has been a pioneer in industrial innovation and has helped \
    improve the way people around the world produce, consume and do business. Based on a strong internal market, the European industry \
    has long powered our economy, providing a stable living for millions and creating the social hubs around which our communities are built.']

SDG_classifier_using_model

model = SDGDetector.SDG_classifier_using_model(model_name='XLNet',model_path=<your path of downloaded model>)

# apply classifier on example input text
sdg,sdg_names,probs = model.predict(text, return_probs=True)

SDG_classifier_using_keywords_extraction

🔑: Some Sentence-Transformers models require authentication to access, especially models hosted on Hugging Face. To use these models, you need to create and set up a Hugging Face token.

os.environ['HF_TOKEN'] = <token>
hf_token = os.getenv('HF_TOKEN')
if hf_token:
    print(f"Hugging Face token is set: {hf_token}")
else:
    print("Hugging Face token is not set.")
mpnet = SDGDetector.SDG_classifier_using_keywords_extraction(top_keywords=5,diversity=0.3,n_gram_range=(1,2),model_name='all-mpnet-base-v2')

keywords_mpnet = mpnet.find_top_keywords(text)

sdg,sdg_name,cosine_similarity,cosine_matrix = mpnet.predict(text,return_cs_matrix_and_avg_cs=True)

SDG_classifier

combo = SDGDetector.SDG_classifier(pretrained_model_name='XLNet',pretrained_model_path=<your path of downloaded model>
                                sentence_model_name='all-mpnet-base-v2',top_keywords=10,diversity=0.3,n_gram_range=(1,2))

sdg,sdg_name,association = combo.predict(text,return_association=True)

🤝 Contributing

Contributions are welcome! Please submit a pull request or open an issue.

📚 Cite

To cite this work, please use:

Knowledge Graph Data Enrichment based on a Software Library for Text Mapping to the Sustainable Development Goals Ioanna Mandilara, Eleni Fotopoulou, Christina Maria Androna, Anastasios Zafeiropoulos, Symeon Papavassiliou

Zenodo repository: DOI

📬 Contact

For any request for detailed information or expression of interest for participating at this initiative, you may contact:

  • 📧 Ioanna Mandilara - ioannamand (at) netmode (dot) ntua (dot) gr
  • 📧 Christina Maria Androna - andronaxm (at) netmode (dot) ntua (dot) gr
  • 📧 Eleni Fotopoulou - efotopoulou (at) netmode (dot) ntua (dot) gr
  • 📧 Anastasios Zafeiropoulos - tzafeir (at) cn (dot) ntua (dot) gr

📑License

This project is licensed under the CC BY-NC 4.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdgdetector-1.3.1.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdgdetector-1.3.1-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file sdgdetector-1.3.1.tar.gz.

File metadata

  • Download URL: sdgdetector-1.3.1.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for sdgdetector-1.3.1.tar.gz
Algorithm Hash digest
SHA256 a26126a96db3752017d0841cfa14ca9e842d2befe1ef5ff9ec7dac089d889352
MD5 817e60d8d3abc855ec67a779aaad7c7d
BLAKE2b-256 4c4bd03ad4a2925adbb37db26b18e08198644827a25da36e4a7c5e6caed1bd5c

See more details on using hashes here.

File details

Details for the file sdgdetector-1.3.1-py3-none-any.whl.

File metadata

  • Download URL: sdgdetector-1.3.1-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for sdgdetector-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4c4ee47ea656c9422114350ab073ee5365b6de85f510516b1af19b85df971a0d
MD5 783b033a933878fe15efe28e4a54a59c
BLAKE2b-256 dce536fd7e0e0ec2180233c33bd929f9f08716caf1f3bac1ddb2a9a0a12e3b80

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page