Skip to main content

Text-tagging project within Yandex x HSE StudCamp event

Project description

Studcamp Yandex x HSE

Text Tagging

My team and I developed a whole module for "Text Tagging" problem, in terms of keywords extraction.

We had a big research. We've tried several extractive and abstractive methods. We will discuss it further.

🚀 Demo

Streamlit

🧪 Preprocessing

  • Embedder Module: FastText/RuBert embeddings realisation
  • Normalizer Modlule: Nouns extraction + Punctuation removal + Stopwords removal (For extractive models)
  • Ranker Module: Module which ranks the most significant words by distance in embedding space (max_distance_ranker) and by cosine similarity with text embeddings (text_sim_ranker)
  • Summarizator Module: Module which summarizes the text with MBart model

🤖 Models

Exctractive models

  • RakeBasedTagger: This is a model which based on a well-known Rake algorithm, that extract meaningful words from text. It's very fast and can be used online. After extracting meaningful words, we should normalize such words and performing filtering with taking only top_n words with the largest distance between query word and other meaningful words. This algorithm supposes that keywords should be as far as possible from each other to represent different domains of the text.
  • BartBasedTagger: This is a rubert-based model which makes an assumption that we could find the most significat words to our text as such with the best cosine similarity. During the pipeline, firstly we need to summarize our text with MBart model, then we should extract the most significant words with cosine similarity for text embedding with each word embedding of the text via rubert represenation. This model is very slow and can be used offline, as we need to summarize text before main processing.
  • AttentionBasedTagger: This is a very interesting model. We assumed that all the algorithms above couldn't catch bigram keywords. So, we decided to use attention mechanism. Let's compute attention activation for every pair of tokens. The biggest activation means, that such words are meaningful to each other. The other problem was that Mbart uses bpe tokenizer and we should perform some post-processing to construct interpretable keywords.
  • ClustrizationBasedTagger: Experimental extractitve model. We used DBSCAN on embeddings of words from normalized text to get clusters of words with similar meaning. Each cluster centroid embedding is a potential keyword. So, we can convert it's embedding to the nearest fasttext word embedding.

Abstractive models

  • RuT5Tagger: Model that was trained on an aggregated dataset from different sources like 'Живой журнал', 'Пикабу' etc. It needs to be mentioned that this model is abstractive, so it can generate new keywords that are not present in the text. Moreover, such model need to be trained on a big dataset further to be able to give good results.

🧐 Features

Here're some of the project's best features:

  • Online model: Rake Based Model with 10-20 it/sec (The fastest)
  • Offline models: Bart based model with summarisation or attention. 1-5 it/sec (The slowest)

🛠️ Installation Steps:

1. Installation

pip install studcamp-yandex-hse

2. Download russian FastText embeddings with link below and paste it in the root of your repository

https://fasttext.cc/docs/en/crawl-vectors.html

3. Import

from studcamp_yandex_hse.models import RakeBasedTagger, BartBasedTagger, AttentionBasedTagger, ClustrizationBasedTagger

3. Init tagger

tagger = RakeBasedTagger()

4. Get tags

text = '...'
top_n = 5

tagger.extract(some_text, top_n)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

studcamp_yandex_hse-0.1.0.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

studcamp_yandex_hse-0.1.0-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file studcamp_yandex_hse-0.1.0.tar.gz.

File metadata

  • Download URL: studcamp_yandex_hse-0.1.0.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.6 Darwin/23.4.0

File hashes

Hashes for studcamp_yandex_hse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 31c0717bdc1929ee1a5e41fe9c1694aaa6c38d173f21bcc61e136aca0b4f0cdc
MD5 7347caefae6aecd1e29160c835397519
BLAKE2b-256 6246300e372058b82ac5f54ce8986a19943eb8e846891e2bbadfea5c3f825fcc

See more details on using hashes here.

File details

Details for the file studcamp_yandex_hse-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for studcamp_yandex_hse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c081b830afdc29411a48981da7b49040c6e727e9b3076a7eda03697d24dcd18c
MD5 98e9fdc5f9f277900203dc69d8177a17
BLAKE2b-256 b01e009e0b45232dec1d47c113c0a7722dacb7762ea03ad2460a8cf224269662

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page