Automatically detect subject indices.
Project description
Subject Indexers
This repository provides two pipelines:
- for processing text and label files in order to train and evaluate an Omikuji model. It includes text lemmatization, TF-IDF feature extraction, label binarization. The system is designed for extreme multilabel classification.
- for processing text and extracting topic keywords using unsupervised methods. Optionally multiword keyword detection can be enabled by using a pretrained PhraserModel. Spelling mistakes can be automatically corrected by enabling SpellCorrector.
⚙️ Installation Guide
Preparing the Environment
Click to expand
-
Set Up Your Python Environment
Ensure you have Python 3.10 or above installed. -
Install Required Dependencies
Install the required dependencies using:pip install -r requirements.txt
🔮 Supervised Omikuji pipeline
Click to expand
🚀 Running the Pipeline
A sample code snippet to train and predict using the Omikuji model is provided below:
from src.supervised.omikuji_model import OmikujiModel
model = OmikujiModel()
# model.load(".../teemamarksonad_est") # Optionally load a pre-trained model and skip training
model.train(
text_file="texts.txt", # File with one document per line
label_file="labels.txt", # File with semicolon-separated labels for each document
language="et", # Language of the text, in ISO 639-1 format
lemmatization_required=True, # (Optional) Whether to lemmatize the text - only set False if text_file is already lemmatized
max_features=20000, # (Optional) Maximum number of features for TF-IDF extraction
keep_train_file=False, # (Optional) Whether to retain intermediate training files
eval_split=0.1 # (Optional) Proportion of the dataset used for evaluation
)
predictions = model.predict(
text="Kui Arno isaga koolimajja jõudis", # Text to classify
top_k=3 # Number of top predictions to return
) # Output: [('koolimajad', 0.262), ('isad', 0.134), ('õpilased', 0.062)]
📂 Data Format
The files provided to the train function should be in the following format:
- A text file (
.txt) where each line is a document.Document one content. Document two content. - A label file (
.txt) where each line contains semicolon-separated labels corresponding to the text file.label1;label2 label3;label4
🛠 Components Overview
| Component | Description |
|---|---|
DataLoader |
Handles reading and preprocessing parallel text-label files. |
TfidfFeatureExtractor |
Extracts TF-IDF features from preprocessed text files. |
LabelBinarizer |
Encodes labels into a sparse binary matrix. |
TextPreprocessor |
Handles text preprocessing, including lemmatization. |
OmikujiModel |
Handles model training using Omikuji, a scalable extreme classification library. |
OmikujiHelpers |
Helper functions for Omikuji model training and evaluation. |
📝 Testing
Run the test suite:
python -m pytest -v tests
⛓️💥 Unsupervised RaKUn + Phraser pipeline
Click to expand
🚀 Running the Pipeline
A sample code snippet to extract keywords from a random text is provided below:
from src.unsupervised.unsup_kw_extractor import KeywordExtractor
from symspellpy import Verbosity
model = KeywordExtractor() # Optionally provide model_artifacts_path to load a pre-trained model.
predictions = model.predict(
text="Kui Arno isaga ...", # Text to classify
lang_code="et", # (Optional) Language of the text, in ISO 639-1 format, if not provided, language is detected automatically
top_n=10, # Number of top predictions to return
merge_threshold=0.0, # (Optional) Threshold for merging words into a single keyword. If 0.0 no words are merged.
use_phraser=True, # (Optional) Whether to use phraser or not. Available Phraser models must be defined in constants.py
correct_spelling=True, # (Optional) Whether to use spell correction or not.
preserve_case=True, # (Optional) Whether to preserve original case or not.
max_uppercase=2, # (Optional) The maximum number of uppercase letters in the word to allow spelling correction.
min_word_frequency=3, # (Optional) The minimum frequency of the word in the input text required for it to NOT be corrected using spelling correction.
) # Output: ['koolimaja']
A sample code snippet to train and predict using the Phraser model is provided below:
from src.unsupervised.phraser_model import PhraserModel
model = PhraserModel()
model.train(
train_data_path=".../train.txt", # File with one document per line, text should be lemmatised.
lang_code="et", # Language of the text, in ISO 639-1 format
min_count=5, # (Optional) Minimum word frequency for phrase formation.
threshold=10.0 # (Optional) Score threshold for forming phrases.
)
predictions = model.predict(
text="'vabariik aastapäev sööma kiluvõileib'", # Lemmatised text for phrase detection
) # Output: ['vabariik_aastapäev', 'sööma', kiluvõileib']
📂 Data Format
The file provided to the PhraserModel train function should be in the following format:
- A text file (
.txt) where each line is a document.Document one content. Document two content.
🛠 Components Overview
| Component | Description |
|---|---|
KeywordExtractor |
Extracts topic keywords from the text using unsupervised methods. Optionally multi-word keywords can be found using a pretrained PhraserModel. Spelling mistakes can be automatically corrected using SpellCorrector. |
PhraserModel |
Handles Gensim Phraser model training and evaluation. |
SpellCorrector |
Handles spelling correction logic using SymSpell. |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rara_subject_indexer-0.0.1.tar.gz.
File metadata
- Download URL: rara_subject_indexer-0.0.1.tar.gz
- Upload date:
- Size: 55.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12950c13721bd5a1060a029d1d4950996fb9e52eb10562ae8d2b9c29f9b4ba96
|
|
| MD5 |
d3feb9feda56ba13c9b112001ed2a936
|
|
| BLAKE2b-256 |
3143864db661db3f363e48822511b93044c4056627d051b3fcc0615f8f0b8f67
|
File details
Details for the file rara_subject_indexer-0.0.1-py3-none-any.whl.
File metadata
- Download URL: rara_subject_indexer-0.0.1-py3-none-any.whl
- Upload date:
- Size: 47.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a29ec8e35ed356c37eec0d6dee74465fbf181a38ca842d7274e20c88151fc93
|
|
| MD5 |
d8d11fd931d44f24cc3ff2943ff4c984
|
|
| BLAKE2b-256 |
64677de87e093c45a375635f21fe458d7ad154f4db1ff1574d82ff7502b2a362
|