Skip to main content

Text-tagging project within Yandex x HSE StudCamp event

Project description

Studcamp Yandex x HSE

Text Tagging

My team and I, within machine learning studcamp by Yandex and HSE, developed a whole module for "Text Tagging" problem, in terms of keywords extraction.

We had a big research. We've tried several extractive and abstractive methods. We will discuss it further.

🚀 Demo

Streamlit Demo

🧪 Preprocessing

  • Embedder Module: FastText/RuBert embeddings realisation
  • Normalizer Modlule: Nouns extraction + Punctuation removal + Stopwords removal (For extractive models)
  • Ranker Module: Module which ranks the most significant words by distance in embedding space (max_distance_ranker) and by cosine similarity with text embeddings (text_sim_ranker)
  • Summarizator Module: Module which summarizes the text with MBart model

🤖 Models

Exctractive models

  • RakeBasedTagger: This is a model which based on a well-known Rake algorithm, that extract meaningful words from text. It's very fast and can be used online. After extracting meaningful words, we should normalize such words and performing filtering with taking only top_n words with the largest distance between query word and other meaningful words. This algorithm supposes that keywords should be as far as possible from each other to represent different domains of the text.
  • BartBasedTagger: This is a rubert-based model which makes an assumption that we could find the most significat words to our text as such with the best cosine similarity. During the pipeline, firstly we need to summarize our text with MBart model, then we should extract the most significant words with cosine similarity for text embedding with each word embedding of the text via rubert represenation. This model is very slow and can be used offline, as we need to summarize text before main processing.
  • AttentionBasedTagger: This is a very interesting model. We assumed that all the algorithms above couldn't catch bigram keywords. So, we decided to use attention mechanism. Let's compute attention activation for every pair of tokens. The biggest activation means, that such words are meaningful to each other. The other problem was that Mbart uses bpe tokenizer and we should perform some post-processing to construct interpretable keywords.
  • ClusterizationBasedTagger: Experimental extractitve model. We used DBSCAN on embeddings of words from normalized text to get clusters of words with similar meaning. Each cluster centroid embedding is a potential keyword. So, we can convert it's embedding to the nearest fasttext word embedding.

Abstractive models

  • RuT5Tagger: Model that was trained on an aggregated dataset from different sources like 'Живой журнал', 'Пикабу' etc. It needs to be mentioned that this model is abstractive, so it can generate new keywords that are not present in the text. Moreover, such model need to be trained on a big dataset further to be able to give good results.

🧐 Features

Here're some of the project's best features:

  • Online model: Rake Based Model with 10-20 it/sec (The fastest)
  • Offline models: Bart based model with summarisation or attention. 1-5 it/sec (The slowest)

🛠️ Installation Steps:

Please, use python@3.10

1. Installation

pip install studcamp-yandex-hse

2. Download russian FastText embeddings and RuT5 weights with the links below and paste it at the same level as your source .py file

FastText embeddings: https://fasttext.cc/docs/en/crawl-vectors.html
Weights: https://drive.google.com/file/d/1aqVtoNRX3xDokthxuBNFwfcXQfkKeAMa/view?usp=sharing

3. Import

from studcamp_yandex_hse.models import RakeBasedTagger, BartBasedTagger, AttentionBasedTagger, ClusterizationBasedTagger, RuT5Tagger
from studcamp_yandex_hse.processing.embedder import FastTextEmbedder

4. Init FastTextEmbedder (We need to pass the instance as argument for Rake and Clusterized models)

ft_emb_model = FastTextEmbedder()

5. Init Model

tagger = RakeBasedTagger(ft_emb_model)

6. Get tags

text = '...'
top_n = 5

tagger.extract(some_text, top_n)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

studcamp_yandex_hse-0.1.2.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

studcamp_yandex_hse-0.1.2-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file studcamp_yandex_hse-0.1.2.tar.gz.

File metadata

  • Download URL: studcamp_yandex_hse-0.1.2.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.6 Darwin/23.4.0

File hashes

Hashes for studcamp_yandex_hse-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fd4c359d206d70c75845ad6d447bd8239b8d969e0c1f537d83f9546f341874e1
MD5 1c1513d75f10731043558fdd1a393565
BLAKE2b-256 16cf77416702228aee68fdf383c23bcbb73e28379ea1095f083f3c31401edcca

See more details on using hashes here.

File details

Details for the file studcamp_yandex_hse-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for studcamp_yandex_hse-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c836969b36cb0deefc3fb59382911a9e27fbce044fe2156ae72784127b4b0706
MD5 bceefade7440cbf010c09e23b77324ff
BLAKE2b-256 907332c62225fb4baabfaf9967a7bdd98afcd3b1bc3bc3ede8bb4f3006836307

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page