Automatically detect subject indices.
Project description
Subject Indexers
This repository provides two pipelines:
- for processing text and label files in order to train and evaluate an Omikuji model. It includes text lemmatization, TF-IDF feature extraction, label binarization. The system is designed for extreme multilabel classification.
- for processing text and extracting topic keywords using unsupervised methods. Optionally multiword keyword detection can be enabled by using a pretrained PhraserModel. Spelling mistakes can be automatically corrected by enabling SpellCorrector.
⚙️ Installation Guide
Click to expand
Preparing the Environment
-
Set Up Your Python Environment
Ensure you have Python 3.10 or above installed. -
Install Required Dependencies
Install the required dependencies using:pip install -r requirements.txt
Installation via PyPI
- Install the Package
You can install the package using:pip install rara-subject-indexer
📝 Testing
Click to expand
Run the test suite:
python -m pytest -v tests
📚 Documentation
Click to expand
Main Classes
The rara-subject-indexer library organizes subject indexing into a few key classes. At its core are the abstract BaseIndexer and two concrete indexers: OmikujiIndexer for supervised keyword extraction and RakunIndexer for unsupervised keyword extraction.
The OmikujiIndexer class uses pre-trained Omikuji models, which can be downloaded with the Downloader utility class.
🔍 Downloader Class
Overview
The Downloader class downloads pretrained models and other relevant data from Google Drive. It accepts a shareable URL or file ID and automatically extracts zip archives after downloading.
Key Function
Click to expand
download()
- Purpose: Downloads the file from Google Drive and extracts it if it is a zip archive.
- Usage: Call this method on an instance of the
Downloaderclass to perform the download and extraction.
Example Usage
Click to expand
Click to copy GDrive URLs of Omikuji models here.
from rara_subject_indexer.utils.downloader import Downloader
drive_url = "https://drive.google.com/file/d/EXAMPLE_FILE_ID/view?usp=drive_link"
downloader = Downloader(drive_url, output_dir="/path/to/save/downloads")
downloader.download()
🔍 BaseIndexer Class
Overview
BaseIndexer serves as the common parent for all indexers. It defines basic configuration parameters (such as language and the number of keywords to extract) and provides the interface for keyword extraction. Subclasses must implement the find_keywords() method.
Parameters
Click to expand
| Name | Type | Optional | Default | Description |
|---|---|---|---|---|
| config | dict | False | None | Base configuration dictionary with keys like language (e.g., "et" or "en") and top_k (number of keywords to extract). |
Key Functions
Click to expand
-
find_keywords(text: str) -> List[Dict]Abstract method for finding or extracting keywords from the input text.
Returns: List of dictionaries representing keyword results (e.g., each with keys"keyword","entity_type", and"score").
🔍 OmikujiIndexer Class
Overview
OmikujiIndexer is a supervised indexer that leverages an Omikuji model for keyword prediction. During initialization, it loads a pre-trained model (via a specified model path) and validates that the model’s language matches the indexer configuration.
Config Parameters
Click to expand
| Name | Type | Optional | Default | Description |
|---|---|---|---|---|
| language | str | False | None | Language of the input text (e.g., "et" or "en"). |
| top_k | int | False | None | Number of keywords to extract. |
| model_path | str | False | None | Path to the Omikuji model file. |
Key Functions
Click to expand
-
find_keywords(text: str) -> List[Dict]Uses the loaded Omikuji model to predict keywords for the provided text.
Returns: A list of dictionaries containing"keyword","entity_type", and"score".
Usage Example
Click to expand
from rara_subject_indexer.indexers.omikuji_indexer import OmikujiIndexer
config = {
"language": "en",
"top_k": 5,
"model_path": "/path/to/omikuji_model" # Use Downloader to download a model
}
indexer = OmikujiIndexer(config)
keywords = indexer.find_keywords("Sample input text for keyword extraction.")
print(keywords)
🔍 RakunIndexer Class
Overview
RakunIndexer provides unsupervised keyword extraction using Rakun’s internal extraction logic. It does not require a separate model file since the extractor is part of the library. The default entity type for keywords is set to "Teemamärksõnad".
Config Parameters
Click to expand
| Name | Type | Optional | Default | Description |
|---|---|---|---|---|
| language | str | False | None | Language of the input text (e.g., "et" or "en"). |
| top_k | int | False | None | Number of keywords to extract. |
| merge_threshold | float | True | 0.0 | Threshold for merging similar keywords. |
| use_phraser | bool | True | False | Whether to use a Phraser model for multi-word keyword detection. |
| correct_spelling | bool | True | False | Whether to correct spelling mistakes in the input text. |
| preserve_case | bool | True | True | Whether to preserve the case of extracted keywords. |
| max_uppercase | int | True | 2 | Maximum number of uppercase characters in a keyword. |
| min_word_frequency | int | True | 3 | Minimum word frequency for keyword extraction. |
Key Functions
Click to expand
-
find_keywords(text: str) -> List[Dict]Uses Rakun-based unsupervised extraction to predict keywords from the input text.
Returns: A list of dictionaries where each dictionary contains"keyword","entity_type", and"score".
Usage Example
Click to expand
from rara_subject_indexer.indexers.rakun_indexer import RakunIndexer
config = {
"language": "et",
"top_k": 5,
"merge_threshold": 0.0, # Optional
"use_phraser": False, # Optional
"correct_spelling": False, # Optional
"preserve_case": True, # Optional
"max_uppercase": 2, # Optional
"min_word_frequency": 3 # Optional
}
indexer = RakunIndexer(config)
keywords = indexer.find_keywords("Sample input text for keyword extraction.")
print(keywords)
Training Supervised and Unsupervised Models
If necessary, you can train the supervised and unsupervised models from scratch using the provided pipelines. The training process involves reading text and label files, preprocessing the text, and training the models using the extracted features.
Click to expand
Training an Omikuji Model for Supervised Keyword Extraction
Click to expand
A sample code snippet to train and predict using the Omikuji model is provided below:
from rara_subject_indexer.supervised.omikuji_model import OmikujiModel
model = OmikujiModel()
model.train(
text_file="texts.txt", # File with one document per line
label_file="labels.txt", # File with semicolon-separated labels for each document
language="et", # Language of the text, in ISO 639-1 format
entity_type="Teemamärksõnad", # Entity type for the keywords
lemmatization_required=True, # (Optional) Whether to lemmatize the text - only set False if text_file is already lemmatized
max_features=20000, # (Optional) Maximum number of features for TF-IDF extraction
keep_train_file=False, # (Optional) Whether to retain intermediate training files
eval_split=0.1 # (Optional) Proportion of the dataset used for evaluation
)
predictions = model.predict(
text="Kui Arno isaga koolimajja jõudis", # Text to classify
top_k=3 # Number of top predictions to return
) # Output: [('koolimajad', 0.262), ('isad', 0.134), ('õpilased', 0.062)]
📂 Data Format
The files provided to the train function should be in the following format:
- A text file (
.txt) where each line is a document.Document one content. Document two content. - A label file (
.txt) where each line contains semicolon-separated labels corresponding to the text file.label1;label2 label3;label4
🛠 Components Overview
| Component | Description |
|---|---|
DataLoader |
Handles reading and preprocessing parallel text-label files. |
TfidfFeatureExtractor |
Extracts TF-IDF features from preprocessed text files. |
LabelBinarizer |
Encodes labels into a sparse binary matrix. |
TextPreprocessor |
Handles text preprocessing, including lemmatization. |
OmikujiModel |
Handles model training using Omikuji, a scalable extreme classification library. |
OmikujiHelpers |
Helper functions for Omikuji model training and evaluation. |
Training Phraser for Unsupervised Keyword Extraction
Click to expand
A sample code snippet to train and predict using the Phraser model is provided below:
from rara_subject_indexer.unsupervised.phraser_model import PhraserModel
model = PhraserModel()
model.train(
train_data_path=".../train.txt", # File with one document per line, text should be lemmatised.
lang_code="et", # Language of the text, in ISO 639-1 format
min_count=5, # (Optional) Minimum word frequency for phrase formation.
threshold=10.0 # (Optional) Score threshold for forming phrases.
)
predictions = model.predict(
text="'vabariik aastapäev sööma kiluvõileib'", # Lemmatised text for phrase detection
) # Output: ['vabariik_aastapäev', 'sööma', kiluvõileib']
📂 Data Format
The file provided to the PhraserModel train function should be in the following format:
- A text file (
.txt) where each line is a document.Document one content. Document two content.
🛠 Components Overview
| Component | Description |
|---|---|
KeywordExtractor |
Extracts topic keywords from the text using unsupervised methods. Optionally multi-word keywords can be found using a pretrained PhraserModel. Spelling mistakes can be automatically corrected using SpellCorrector. |
PhraserModel |
Handles Gensim Phraser model training and evaluation. |
SpellCorrector |
Handles spelling correction logic using SymSpell. |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rara_subject_indexer-0.0.4.tar.gz.
File metadata
- Download URL: rara_subject_indexer-0.0.4.tar.gz
- Upload date:
- Size: 11.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
573b2368bb333598ee2d9be5f2560806c18ea0ace04e1754823d57588745070b
|
|
| MD5 |
981c90d460d32449d36f1cce4ecd7860
|
|
| BLAKE2b-256 |
9497971ac531d8dd9c7f41f07ac17445f7a56e9520f35b02d4eb5868db9b420e
|
File details
Details for the file rara_subject_indexer-0.0.4-py3-none-any.whl.
File metadata
- Download URL: rara_subject_indexer-0.0.4-py3-none-any.whl
- Upload date:
- Size: 11.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f319aa2478571aa5054b0009f362764eb482edb4046358c1f7b79720cfece63
|
|
| MD5 |
25a40de8f1967084c6a31a1d233c1169
|
|
| BLAKE2b-256 |
cfa249199029932912f60397ecd3fc4c53a105dc1c64c886a3e740587830ee4f
|