Automatically detect subject indices.

These details have not been verified by PyPI

Project description

Subject Indexers

This repository provides two pipelines:

for processing text and label files in order to train and evaluate an Omikuji model. It includes text lemmatization, TF-IDF feature extraction, label binarization. The system is designed for extreme multilabel classification.
for processing text and extracting topic keywords using unsupervised methods. Optionally multiword keyword detection can be enabled by using a pretrained PhraserModel. Spelling mistakes can be automatically corrected by enabling SpellCorrector.

⚙️ Installation Guide

Click to expand

Preparing the Environment

Set Up Your Python Environment
Ensure you have Python 3.10 or above installed.
Install Required Dependencies
Install the required dependencies using:
```
pip install -r requirements.txt
```

Installation via PyPI

Install the Package
You can install the package using:
```
pip install rara-subject-indexer
```

📝 Testing

Click to expand

Run the test suite:

python -m pytest -v tests

📚 Documentation

Click to expand

Main Classes

The rara-subject-indexer library organizes subject indexing into a few key classes. At its core are the abstract BaseIndexer and two concrete indexers: OmikujiIndexer for supervised keyword extraction and RakunIndexer for unsupervised keyword extraction. The OmikujiIndexer class uses pre-trained Omikuji models, which can be downloaded with the Downloader utility class.

🔍 Downloader Class

Overview

The Downloader class downloads pretrained models and other relevant data from Google Drive. It accepts a shareable URL or file ID and automatically extracts zip archives after downloading.

Key Function

Click to expand

download()

Purpose: Downloads the file from Google Drive and extracts it if it is a zip archive.
Usage: Call this method on an instance of the Downloader class to perform the download and extraction.

Example Usage

Click to expand

Click to copy GDrive URLs of Omikuji models here.

from rara_subject_indexer.utils.downloader import Downloader

drive_url = "https://drive.google.com/file/d/EXAMPLE_FILE_ID/view?usp=drive_link"
downloader = Downloader(drive_url, output_dir="/path/to/save/downloads")
downloader.download()

🔍 BaseIndexer Class

Overview

BaseIndexer serves as the common parent for all indexers. It defines basic configuration parameters (such as language and the number of keywords to extract) and provides the interface for keyword extraction. Subclasses must implement the find_keywords() method.

Parameters

Click to expand

Name	Type	Optional	Default	Description
config	dict	False	None	Base configuration dictionary with keys like `language` (e.g., `"et"` or `"en"`) and `top_k` (number of keywords to extract).

Key Functions

Click to expand

find_keywords(text: str) -> List[Dict]

Abstract method for finding or extracting keywords from the input text.
Returns: List of dictionaries representing keyword results (e.g., each with keys "keyword", "entity_type", and "score").

🔍 OmikujiIndexer Class

Overview

OmikujiIndexer is a supervised indexer that leverages an Omikuji model for keyword prediction. During initialization, it loads a pre-trained model (via a specified model path) and validates that the model’s language matches the indexer configuration.

Config Parameters

Click to expand

Name	Type	Optional	Default	Description
language	str	False	None	Language of the input text (e.g., `"et"` or `"en"`).
top_k	int	False	None	Number of keywords to extract.
model_path	str	False	None	Path to the Omikuji model file.

Key Functions

Click to expand

find_keywords(text: str) -> List[Dict]

Uses the loaded Omikuji model to predict keywords for the provided text.
Returns: A list of dictionaries containing "keyword", "entity_type", and "score".

Usage Example

Click to expand

from rara_subject_indexer.indexers.omikuji_indexer import OmikujiIndexer

config = {
    "language": "en",
    "top_k": 5,
    "model_path": "/path/to/omikuji_model"  # Use Downloader to download a model
}

indexer = OmikujiIndexer(config)
keywords = indexer.find_keywords("Sample input text for keyword extraction.")
print(keywords)

🔍 RakunIndexer Class

Overview

RakunIndexer provides unsupervised keyword extraction using Rakun’s internal extraction logic. It does not require a separate model file since the extractor is part of the library. The default entity type for keywords is set to "Teemamärksõnad".

Config Parameters

Click to expand

Name	Type	Optional	Default	Description
language	str	False	None	Language of the input text (e.g., `"et"` or `"en"`).
top_k	int	False	None	Number of keywords to extract.
merge_threshold	float	True	0.0	Threshold for merging similar keywords.
use_phraser	bool	True	False	Whether to use a Phraser model for multi-word keyword detection.
correct_spelling	bool	True	False	Whether to correct spelling mistakes in the input text.
preserve_case	bool	True	True	Whether to preserve the case of extracted keywords.
max_uppercase	int	True	2	Maximum number of uppercase characters in a keyword.
min_word_frequency	int	True	3	Minimum word frequency for keyword extraction.

Key Functions

Click to expand

find_keywords(text: str) -> List[Dict]

Uses Rakun-based unsupervised extraction to predict keywords from the input text.
Returns: A list of dictionaries where each dictionary contains "keyword", "entity_type", and "score".

Usage Example

Click to expand

from rara_subject_indexer.indexers.rakun_indexer import RakunIndexer

config = {
   "language": "et",
   "top_k": 5,
   "merge_threshold": 0.0,      # Optional
   "use_phraser": False,        # Optional
   "correct_spelling": False,   # Optional
   "preserve_case": True,       # Optional
   "max_uppercase": 2,          # Optional
   "min_word_frequency": 3      # Optional
}

indexer = RakunIndexer(config)
keywords = indexer.find_keywords("Sample input text for keyword extraction.")
print(keywords)

Training Supervised and Unsupervised Models

If necessary, you can train the supervised and unsupervised models from scratch using the provided pipelines. The training process involves reading text and label files, preprocessing the text, and training the models using the extracted features.

Click to expand

Training an Omikuji Model for Supervised Keyword Extraction

Click to expand

A sample code snippet to train and predict using the Omikuji model is provided below:

from rara_subject_indexer.supervised.omikuji_model import OmikujiModel

model = OmikujiModel()

model.train(
    text_file="texts.txt",  # File with one document per line
    label_file="labels.txt",  # File with semicolon-separated labels for each document
    language="et",  # Language of the text, in ISO 639-1 format
    entity_type="Teemamärksõnad",  # Entity type for the keywords
    lemmatization_required=True, # (Optional) Whether to lemmatize the text - only set False if text_file is already lemmatized
    max_features=20000,  # (Optional) Maximum number of features for TF-IDF extraction
    keep_train_file=False,  # (Optional) Whether to retain intermediate training files
    eval_split=0.1  # (Optional) Proportion of the dataset used for evaluation
)

predictions = model.predict(
    text="Kui Arno isaga koolimajja jõudis",  # Text to classify
    top_k=3  # Number of top predictions to return
)  # Output: [('koolimajad', 0.262), ('isad', 0.134), ('õpilased', 0.062)]

📂 Data Format

The files provided to the train function should be in the following format:

A text file (.txt) where each line is a document.
```
Document one content.
Document two content.
```
A label file (.txt) where each line contains semicolon-separated labels corresponding to the text file.
```
label1;label2
label3;label4
```

🛠 Components Overview

Component	Description
`DataLoader`	Handles reading and preprocessing parallel text-label files.
`TfidfFeatureExtractor`	Extracts TF-IDF features from preprocessed text files.
`LabelBinarizer`	Encodes labels into a sparse binary matrix.
`TextPreprocessor`	Handles text preprocessing, including lemmatization.
`OmikujiModel`	Handles model training using Omikuji, a scalable extreme classification library.
`OmikujiHelpers`	Helper functions for Omikuji model training and evaluation.

Training Phraser for Unsupervised Keyword Extraction

Click to expand

A sample code snippet to train and predict using the Phraser model is provided below:

from rara_subject_indexer.unsupervised.phraser_model import PhraserModel

model = PhraserModel()

model.train(
    train_data_path=".../train.txt",  # File with one document per line, text should be lemmatised.
    lang_code="et",  # Language of the text, in ISO 639-1 format
    min_count=5,  # (Optional) Minimum word frequency for phrase formation.
    threshold=10.0  # (Optional) Score threshold for forming phrases.
)

predictions = model.predict(
    text="'vabariik aastapäev sööma kiluvõileib'",  # Lemmatised text for phrase detection
)  # Output: ['vabariik_aastapäev', 'sööma', kiluvõileib']

📂 Data Format

The file provided to the PhraserModel train function should be in the following format:

A text file (.txt) where each line is a document.
```
Document one content.
Document two content.
```

🛠 Components Overview

Component	Description
`KeywordExtractor`	Extracts topic keywords from the text using unsupervised methods. Optionally multi-word keywords can be found using a pretrained PhraserModel. Spelling mistakes can be automatically corrected using SpellCorrector.
`PhraserModel`	Handles Gensim Phraser model training and evaluation.
`SpellCorrector`	Handles spelling correction logic using SymSpell.

Project details

These details have not been verified by PyPI

Intended Audience
- Science/Research
Programming Language

Release history Release notifications | RSS feed

3.0.33

Jan 13, 2026

3.0.32

Oct 24, 2025

3.0.31

Sep 19, 2025

3.0.30

Sep 18, 2025

3.0.29

Sep 16, 2025

3.0.28

Aug 6, 2025

3.0.27

Aug 5, 2025

3.0.26

Aug 5, 2025

3.0.25

Aug 4, 2025

3.0.24

Aug 1, 2025

3.0.23

Jul 31, 2025

3.0.22

Jul 31, 2025

3.0.21

Jul 30, 2025

3.0.20

Jul 29, 2025

3.0.19

Jul 29, 2025

3.0.18

Jul 15, 2025

3.0.17

Jul 8, 2025

3.0.16

Jul 4, 2025

3.0.15

Jul 3, 2025

3.0.14

Jul 3, 2025

3.0.13

Jul 2, 2025

3.0.12

Jun 19, 2025

3.0.11

Jun 6, 2025

3.0.10

Jun 4, 2025

3.0.9

Jun 4, 2025

3.0.8

Jun 3, 2025

3.0.6

May 26, 2025

3.0.5

May 20, 2025

3.0.4

May 17, 2025

3.0.3

May 13, 2025

3.0.2

May 7, 2025

3.0.1

May 7, 2025

3.0.0

Apr 18, 2025

2.0.3

Apr 16, 2025

2.0.2

Apr 16, 2025

2.0.1

Apr 16, 2025

2.0.0

Apr 16, 2025

1.0.0

Mar 10, 2025

0.0.5

Mar 4, 2025

0.0.4

Mar 3, 2025

This version

0.0.3

Mar 3, 2025

0.0.2

Mar 3, 2025

0.0.1

Feb 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rara_subject_indexer-0.0.3.tar.gz (11.0 MB view details)

Uploaded Mar 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rara_subject_indexer-0.0.3-py3-none-any.whl (11.2 MB view details)

Uploaded Mar 3, 2025 Python 3

File details

Details for the file rara_subject_indexer-0.0.3.tar.gz.

File metadata

Download URL: rara_subject_indexer-0.0.3.tar.gz
Upload date: Mar 3, 2025
Size: 11.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rara_subject_indexer-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`0e221064e5838caa568c200ed6246a9ef3bcdcb0224574d3b7264e766d0b4698`
MD5	`4eace4b8033c4605194582fc360ad6a2`
BLAKE2b-256	`102facbfdb692cfa2aefa14cb33f31ee5305fd59f87488a0930d225bbe8f78a8`

See more details on using hashes here.

File details

Details for the file rara_subject_indexer-0.0.3-py3-none-any.whl.

File metadata

Download URL: rara_subject_indexer-0.0.3-py3-none-any.whl
Upload date: Mar 3, 2025
Size: 11.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rara_subject_indexer-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e10702b063b7142c1c0a2b45679c1c680876522f3118a8cac26fe66f0c180ba0`
MD5	`8fe9cc33e952861605f1c8544e60914c`
BLAKE2b-256	`81e8b2a8ba0e28d87112cf907e1d6494a2f39d529cee53d3b5c5febde638d4d5`

See more details on using hashes here.

rara-subject-indexer 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Subject Indexers

⚙️ Installation Guide

Preparing the Environment

Installation via PyPI

📝 Testing

📚 Documentation

Main Classes

🔍 Downloader Class

Overview

Key Function

Example Usage

🔍 BaseIndexer Class

Overview

Parameters

Key Functions

🔍 OmikujiIndexer Class

Overview

Config Parameters

Key Functions

Usage Example

🔍 RakunIndexer Class

Overview

Config Parameters

Key Functions

Usage Example

Training Supervised and Unsupervised Models

Training an Omikuji Model for Supervised Keyword Extraction

📂 Data Format

🛠 Components Overview

Training Phraser for Unsupervised Keyword Extraction

📂 Data Format

🛠 Components Overview

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes