Skip to main content

Text-tagging project within Yandex x HSE StudCamp event

Project description

Studcamp Yandex x HSE

Text Tagging

My team and I, within machine learning studcamp by Yandex and HSE, developed a whole module for "Text Tagging" problem, in terms of keywords extraction.

We had a big research. We've tried several extractive and abstractive methods. We will discuss it further.

🚀 Demo

Streamlit

🧪 Preprocessing

  • Embedder Module: FastText/RuBert embeddings realisation
  • Normalizer Modlule: Nouns extraction + Punctuation removal + Stopwords removal (For extractive models)
  • Ranker Module: Module which ranks the most significant words by distance in embedding space (max_distance_ranker) and by cosine similarity with text embeddings (text_sim_ranker)
  • Summarizator Module: Module which summarizes the text with MBart model

🤖 Models

Exctractive models

  • RakeBasedTagger: This is a model which based on a well-known Rake algorithm, that extract meaningful words from text. It's very fast and can be used online. After extracting meaningful words, we should normalize such words and performing filtering with taking only top_n words with the largest distance between query word and other meaningful words. This algorithm supposes that keywords should be as far as possible from each other to represent different domains of the text.
  • BartBasedTagger: This is a rubert-based model which makes an assumption that we could find the most significat words to our text as such with the best cosine similarity. During the pipeline, firstly we need to summarize our text with MBart model, then we should extract the most significant words with cosine similarity for text embedding with each word embedding of the text via rubert represenation. This model is very slow and can be used offline, as we need to summarize text before main processing.
  • AttentionBasedTagger: This is a very interesting model. We assumed that all the algorithms above couldn't catch bigram keywords. So, we decided to use attention mechanism. Let's compute attention activation for every pair of tokens. The biggest activation means, that such words are meaningful to each other. The other problem was that Mbart uses bpe tokenizer and we should perform some post-processing to construct interpretable keywords.
  • ClusterizationBasedTagger: Experimental extractitve model. We used DBSCAN on embeddings of words from normalized text to get clusters of words with similar meaning. Each cluster centroid embedding is a potential keyword. So, we can convert it's embedding to the nearest fasttext word embedding.

Abstractive models

  • RuT5Tagger: Model that was trained on an aggregated dataset from different sources like 'Живой журнал', 'Пикабу' etc. It needs to be mentioned that this model is abstractive, so it can generate new keywords that are not present in the text. Moreover, such model need to be trained on a big dataset further to be able to give good results.

🧐 Features

Here're some of the project's best features:

  • Online model: Rake Based Model with 10-20 it/sec (The fastest)
  • Offline models: Bart based model with summarisation or attention. 1-5 it/sec (The slowest)

🛠️ Installation Steps:

1. Installation

pip install studcamp-yandex-hse

2. Download russian FastText embeddings and RuT5 weights with the links below and paste it at the same level as your source .py file

FastText embeddings: https://fasttext.cc/docs/en/crawl-vectors.html
Weights: https://drive.google.com/file/d/1aqVtoNRX3xDokthxuBNFwfcXQfkKeAMa/view?usp=sharing

3. Import

from studcamp_yandex_hse.models import RakeBasedTagger, BartBasedTagger, AttentionBasedTagger, ClusterizationBasedTagger, RuT5Tagger

3. Init tagger

tagger = RakeBasedTagger()

4. Get tags

text = '...'
top_n = 5

tagger.extract(some_text, top_n)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

studcamp_yandex_hse-0.1.1.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

studcamp_yandex_hse-0.1.1-py3-none-any.whl (23.8 kB view details)

Uploaded Python 3

File details

Details for the file studcamp_yandex_hse-0.1.1.tar.gz.

File metadata

  • Download URL: studcamp_yandex_hse-0.1.1.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.6 Darwin/23.4.0

File hashes

Hashes for studcamp_yandex_hse-0.1.1.tar.gz
Algorithm Hash digest
SHA256 31bea87e6e34ce0abb8e19785c16e8fc233224df691d29cef43cb42565f9f354
MD5 b907394c581b3750a614c7738060ef76
BLAKE2b-256 e18947459c38c3b0351664e05712886b4aac01fccd78224e0a7b0196dd4f2523

See more details on using hashes here.

File details

Details for the file studcamp_yandex_hse-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for studcamp_yandex_hse-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d6de0c1574073c9e7e3eb29226f58e7a92f6924b347411a87581d897664d92ab
MD5 b6130f856c6f5d46d675bc7809c7c2d3
BLAKE2b-256 0ed5c091460e19d363693b7e1ae534b458258c02e84d8bd68b1f3e450c926081

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page