Automatically detect subject indices.

These details have not been verified by PyPI

Project description

Subject Indexers

This repository provides two pipelines:

for processing text and label files in order to train and evaluate an Omikuji model. It includes text lemmatization, TF-IDF feature extraction, label binarization. The system is designed for extreme multilabel classification.
for processing text and extracting topic keywords using unsupervised methods. Optionally multiword keyword detection can be enabled by using a pretrained PhraserModel. Spelling mistakes can be automatically corrected by enabling SpellCorrector.

⚙️ Installation Guide

Preparing the Environment

Click to expand

Set Up Your Python Environment
Ensure you have Python 3.10 or above installed.
Install Required Dependencies
Install the required dependencies using:
```
pip install -r requirements.txt
```

🔮 Supervised Omikuji pipeline

Click to expand

🚀 Running the Pipeline

A sample code snippet to train and predict using the Omikuji model is provided below:

from src.supervised.omikuji_model import OmikujiModel

model = OmikujiModel()  
# model.load(".../teemamarksonad_est") # Optionally load a pre-trained model and skip training

model.train(
    text_file="texts.txt",          # File with one document per line
    label_file="labels.txt",        # File with semicolon-separated labels for each document
    language="et",                  # Language of the text, in ISO 639-1 format
    lemmatization_required=True,    # (Optional) Whether to lemmatize the text - only set False if text_file is already lemmatized
    max_features=20000,             # (Optional) Maximum number of features for TF-IDF extraction
    keep_train_file=False,          # (Optional) Whether to retain intermediate training files
    eval_split=0.1                  # (Optional) Proportion of the dataset used for evaluation
)

predictions = model.predict(
    text="Kui Arno isaga koolimajja jõudis",  # Text to classify
    top_k=3                                   # Number of top predictions to return
)  # Output: [('koolimajad', 0.262), ('isad', 0.134), ('õpilased', 0.062)]

📂 Data Format

The files provided to the train function should be in the following format:

A text file (.txt) where each line is a document.
```
Document one content.
Document two content.
```
A label file (.txt) where each line contains semicolon-separated labels corresponding to the text file.
```
label1;label2
label3;label4
```

🛠 Components Overview

Component	Description
`DataLoader`	Handles reading and preprocessing parallel text-label files.
`TfidfFeatureExtractor`	Extracts TF-IDF features from preprocessed text files.
`LabelBinarizer`	Encodes labels into a sparse binary matrix.
`TextPreprocessor`	Handles text preprocessing, including lemmatization.
`OmikujiModel`	Handles model training using Omikuji, a scalable extreme classification library.
`OmikujiHelpers`	Helper functions for Omikuji model training and evaluation.

📝 Testing

Run the test suite:

python -m pytest -v tests

⛓️‍💥 Unsupervised RaKUn + Phraser pipeline

Click to expand

🚀 Running the Pipeline

A sample code snippet to extract keywords from a random text is provided below:

from src.unsupervised.unsup_kw_extractor import KeywordExtractor
from symspellpy import Verbosity

model = KeywordExtractor()  # Optionally provide model_artifacts_path to load a pre-trained model.

predictions = model.predict(
    text="Kui Arno isaga ...",  # Text to classify
    lang_code="et",             # (Optional) Language of the text, in ISO 639-1 format, if not provided, language is detected automatically
    top_n=10,                   # Number of top predictions to return
    merge_threshold=0.0,        # (Optional) Threshold for merging words into a single keyword. If 0.0 no words are merged.
    use_phraser=True,           # (Optional) Whether to use phraser or not. Available Phraser models must be defined in constants.py
    correct_spelling=True,      # (Optional) Whether to use spell correction or not.
    preserve_case=True,         # (Optional) Whether to preserve original case or not.
    max_uppercase=2,            # (Optional) The maximum number of uppercase letters in the word to allow spelling correction.
    min_word_frequency=3,       # (Optional) The minimum frequency of the word in the input text required for it to NOT be corrected using spelling correction.
)  # Output: ['koolimaja']

A sample code snippet to train and predict using the Phraser model is provided below:

from src.unsupervised.phraser_model import PhraserModel

model = PhraserModel()

model.train(
    train_data_path=".../train.txt",  # File with one document per line, text should be lemmatised.
    lang_code="et",                      # Language of the text, in ISO 639-1 format
    min_count=5,                         # (Optional) Minimum word frequency for phrase formation.
    threshold=10.0                       # (Optional) Score threshold for forming phrases.
)

predictions = model.predict(
    text="'vabariik aastapäev sööma kiluvõileib'",  # Lemmatised text for phrase detection
)  # Output: ['vabariik_aastapäev', 'sööma', kiluvõileib']

📂 Data Format

The file provided to the PhraserModel train function should be in the following format:

A text file (.txt) where each line is a document.
```
Document one content.
Document two content.
```

🛠 Components Overview

Component	Description
`KeywordExtractor`	Extracts topic keywords from the text using unsupervised methods. Optionally multi-word keywords can be found using a pretrained PhraserModel. Spelling mistakes can be automatically corrected using SpellCorrector.
`PhraserModel`	Handles Gensim Phraser model training and evaluation.
`SpellCorrector`	Handles spelling correction logic using SymSpell.

Project details

These details have not been verified by PyPI

Intended Audience
- Science/Research
Programming Language

Release history Release notifications | RSS feed

3.0.33

Jan 13, 2026

3.0.32

Oct 24, 2025

3.0.31

Sep 19, 2025

3.0.30

Sep 18, 2025

3.0.29

Sep 16, 2025

3.0.28

Aug 6, 2025

3.0.27

Aug 5, 2025

3.0.26

Aug 5, 2025

3.0.25

Aug 4, 2025

3.0.24

Aug 1, 2025

3.0.23

Jul 31, 2025

3.0.22

Jul 31, 2025

3.0.21

Jul 30, 2025

3.0.20

Jul 29, 2025

3.0.19

Jul 29, 2025

3.0.18

Jul 15, 2025

3.0.17

Jul 8, 2025

3.0.16

Jul 4, 2025

3.0.15

Jul 3, 2025

3.0.14

Jul 3, 2025

3.0.13

Jul 2, 2025

3.0.12

Jun 19, 2025

3.0.11

Jun 6, 2025

3.0.10

Jun 4, 2025

3.0.9

Jun 4, 2025

3.0.8

Jun 3, 2025

3.0.6

May 26, 2025

3.0.5

May 20, 2025

3.0.4

May 17, 2025

3.0.3

May 13, 2025

3.0.2

May 7, 2025

3.0.1

May 7, 2025

3.0.0

Apr 18, 2025

2.0.3

Apr 16, 2025

2.0.2

Apr 16, 2025

2.0.1

Apr 16, 2025

2.0.0

Apr 16, 2025

1.0.0

Mar 10, 2025

0.0.5

Mar 4, 2025

0.0.4

Mar 3, 2025

0.0.3

Mar 3, 2025

0.0.2

Mar 3, 2025

This version

0.0.1

Feb 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rara_subject_indexer-0.0.1.tar.gz (55.9 kB view details)

Uploaded Feb 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rara_subject_indexer-0.0.1-py3-none-any.whl (47.6 kB view details)

Uploaded Feb 25, 2025 Python 3

File details

Details for the file rara_subject_indexer-0.0.1.tar.gz.

File metadata

Download URL: rara_subject_indexer-0.0.1.tar.gz
Upload date: Feb 25, 2025
Size: 55.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rara_subject_indexer-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`12950c13721bd5a1060a029d1d4950996fb9e52eb10562ae8d2b9c29f9b4ba96`
MD5	`d3feb9feda56ba13c9b112001ed2a936`
BLAKE2b-256	`3143864db661db3f363e48822511b93044c4056627d051b3fcc0615f8f0b8f67`

See more details on using hashes here.

File details

Details for the file rara_subject_indexer-0.0.1-py3-none-any.whl.

File metadata

Download URL: rara_subject_indexer-0.0.1-py3-none-any.whl
Upload date: Feb 25, 2025
Size: 47.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rara_subject_indexer-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a29ec8e35ed356c37eec0d6dee74465fbf181a38ca842d7274e20c88151fc93`
MD5	`d8d11fd931d44f24cc3ff2943ff4c984`
BLAKE2b-256	`64677de87e093c45a375635f21fe458d7ad154f4db1ff1574d82ff7502b2a362`

See more details on using hashes here.

rara-subject-indexer 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Subject Indexers

⚙️ Installation Guide

Preparing the Environment

🔮 Supervised Omikuji pipeline

🚀 Running the Pipeline

📂 Data Format

🛠 Components Overview

📝 Testing

⛓️‍💥 Unsupervised RaKUn + Phraser pipeline

🚀 Running the Pipeline

📂 Data Format

🛠 Components Overview

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes