Text-tagging project within Yandex x HSE StudCamp event

These details have not been verified by PyPI

Project description

Studcamp Yandex x HSE

Text Tagging

My team and I, within machine learning studcamp by Yandex and HSE, developed a whole module for "Text Tagging" problem, in terms of keywords extraction.

We had a big research. We've tried several extractive and abstractive methods. We will discuss it further.

🚀 Demo

Streamlit Demo

🧪 Preprocessing

Embedder Module: FastText/RuBert embeddings realisation
Normalizer Modlule: Nouns extraction + Punctuation removal + Stopwords removal (For extractive models)
Ranker Module: Module which ranks the most significant words by distance in embedding space (max_distance_ranker) and by cosine similarity with text embeddings (text_sim_ranker)
Summarizator Module: Module which summarizes the text with MBart model

🤖 Models

Exctractive models

RakeBasedTagger: This is a model which based on a well-known Rake algorithm, that extract meaningful words from text. It's very fast and can be used online. After extracting meaningful words, we should normalize such words and performing filtering with taking only top_n words with the largest distance between query word and other meaningful words. This algorithm supposes that keywords should be as far as possible from each other to represent different domains of the text.
BartBasedTagger: This is a rubert-based model which makes an assumption that we could find the most significat words to our text as such with the best cosine similarity. During the pipeline, firstly we need to summarize our text with MBart model, then we should extract the most significant words with cosine similarity for text embedding with each word embedding of the text via rubert represenation. This model is very slow and can be used offline, as we need to summarize text before main processing.
AttentionBasedTagger: This is a very interesting model. We assumed that all the algorithms above couldn't catch bigram keywords. So, we decided to use attention mechanism. Let's compute attention activation for every pair of tokens. The biggest activation means, that such words are meaningful to each other. The other problem was that Mbart uses bpe tokenizer and we should perform some post-processing to construct interpretable keywords.
ClusterizationBasedTagger: Experimental extractitve model. We used DBSCAN on embeddings of words from normalized text to get clusters of words with similar meaning. Each cluster centroid embedding is a potential keyword. So, we can convert it's embedding to the nearest fasttext word embedding.

Abstractive models

RuT5Tagger: Model that was trained on an aggregated dataset from different sources like 'Живой журнал', 'Пикабу' etc. It needs to be mentioned that this model is abstractive, so it can generate new keywords that are not present in the text. Moreover, such model need to be trained on a big dataset further to be able to give good results.

🧐 Features

Here're some of the project's best features:

Online model: Rake Based Model with 10-20 it/sec (The fastest)
Offline models: Bart based model with summarisation or attention. 1-5 it/sec (The slowest)

🛠️ Installation Steps:

Please, use python@3.10

1. Installation

pip install studcamp-yandex-hse

2. Download russian FastText embeddings and RuT5 weights with the links below and paste it at the same level as your source .py file

FastText embeddings: https://fasttext.cc/docs/en/crawl-vectors.html
Weights: https://drive.google.com/file/d/1aqVtoNRX3xDokthxuBNFwfcXQfkKeAMa/view?usp=sharing

3. Import

from studcamp_yandex_hse.models import RakeBasedTagger, BartBasedTagger, AttentionBasedTagger, ClusterizationBasedTagger, RuT5Tagger
from studcamp_yandex_hse.processing.embedder import FastTextEmbedder

4. Init FastTextEmbedder (We need to pass the instance as argument for Rake and Clusterized models)

ft_emb_model = FastTextEmbedder()

5. Init Model

tagger = RakeBasedTagger(ft_emb_model)

6. Get tags

text = '...'
top_n = 5

tagger.extract(some_text, top_n)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

May 1, 2024

0.1.1

Apr 28, 2024

0.1.0

Apr 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

studcamp_yandex_hse-0.1.2.tar.gz (14.1 kB view details)

Uploaded May 1, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

studcamp_yandex_hse-0.1.2-py3-none-any.whl (24.0 kB view details)

Uploaded May 1, 2024 Python 3

File details

Details for the file studcamp_yandex_hse-0.1.2.tar.gz.

File metadata

Download URL: studcamp_yandex_hse-0.1.2.tar.gz
Upload date: May 1, 2024
Size: 14.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.11.6 Darwin/23.4.0

File hashes

Hashes for studcamp_yandex_hse-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`fd4c359d206d70c75845ad6d447bd8239b8d969e0c1f537d83f9546f341874e1`
MD5	`1c1513d75f10731043558fdd1a393565`
BLAKE2b-256	`16cf77416702228aee68fdf383c23bcbb73e28379ea1095f083f3c31401edcca`

See more details on using hashes here.

File details

Details for the file studcamp_yandex_hse-0.1.2-py3-none-any.whl.

File metadata

Download URL: studcamp_yandex_hse-0.1.2-py3-none-any.whl
Upload date: May 1, 2024
Size: 24.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.11.6 Darwin/23.4.0

File hashes

Hashes for studcamp_yandex_hse-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c836969b36cb0deefc3fb59382911a9e27fbce044fe2156ae72784127b4b0706`
MD5	`bceefade7440cbf010c09e23b77324ff`
BLAKE2b-256	`907332c62225fb4baabfaf9967a7bdd98afcd3b1bc3bc3ede8bb4f3006836307`

See more details on using hashes here.

studcamp-yandex-hse 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Studcamp Yandex x HSE

Text Tagging

My team and I, within machine learning studcamp by Yandex and HSE, developed a whole module for "Text Tagging" problem, in terms of keywords extraction.

We had a big research. We've tried several extractive and abstractive methods. We will discuss it further.

🚀 Demo

🧪 Preprocessing

🤖 Models

Exctractive models

Abstractive models

🧐 Features

🛠️ Installation Steps:

Please, use python@3.10

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes