Skip to main content

Automatically detect subject indices.

Project description

Subject Indexers

This repository provides two pipelines:

  1. for processing text and label files in order to train and evaluate an Omikuji model. It includes text lemmatization, TF-IDF feature extraction, label binarization. The system is designed for extreme multilabel classification.
  2. for processing text and extracting topic keywords using unsupervised methods. Optionally multiword keyword detection can be enabled by using a pretrained PhraserModel. Spelling mistakes can be automatically corrected by enabling SpellCorrector.

⚙️ Installation Guide

Preparing the Environment

Click to expand
  1. Set Up Your Python Environment
    Ensure you have Python 3.10 or above installed.

  2. Install Required Dependencies
    Install the required dependencies using:

    pip install -r requirements.txt
    

🔮 Supervised Omikuji pipeline

Click to expand

🚀 Running the Pipeline

A sample code snippet to train and predict using the Omikuji model is provided below:

from src.supervised.omikuji_model import OmikujiModel

model = OmikujiModel()  
# model.load(".../teemamarksonad_est") # Optionally load a pre-trained model and skip training

model.train(
    text_file="texts.txt",          # File with one document per line
    label_file="labels.txt",        # File with semicolon-separated labels for each document
    language="et",                  # Language of the text, in ISO 639-1 format
    lemmatization_required=True,    # (Optional) Whether to lemmatize the text - only set False if text_file is already lemmatized
    max_features=20000,             # (Optional) Maximum number of features for TF-IDF extraction
    keep_train_file=False,          # (Optional) Whether to retain intermediate training files
    eval_split=0.1                  # (Optional) Proportion of the dataset used for evaluation
)

predictions = model.predict(
    text="Kui Arno isaga koolimajja jõudis",  # Text to classify
    top_k=3                                   # Number of top predictions to return
)  # Output: [('koolimajad', 0.262), ('isad', 0.134), ('õpilased', 0.062)]

📂 Data Format

The files provided to the train function should be in the following format:

  • A text file (.txt) where each line is a document.
    Document one content.
    Document two content.
    
  • A label file (.txt) where each line contains semicolon-separated labels corresponding to the text file.
    label1;label2
    label3;label4
    

🛠 Components Overview

Component Description
DataLoader Handles reading and preprocessing parallel text-label files.
TfidfFeatureExtractor Extracts TF-IDF features from preprocessed text files.
LabelBinarizer Encodes labels into a sparse binary matrix.
TextPreprocessor Handles text preprocessing, including lemmatization.
OmikujiModel Handles model training using Omikuji, a scalable extreme classification library.
OmikujiHelpers Helper functions for Omikuji model training and evaluation.

📝 Testing

Run the test suite:

python -m pytest -v tests

⛓️‍💥 Unsupervised RaKUn + Phraser pipeline

Click to expand

🚀 Running the Pipeline

A sample code snippet to extract keywords from a random text is provided below:

from src.unsupervised.unsup_kw_extractor import KeywordExtractor
from symspellpy import Verbosity

model = KeywordExtractor()  # Optionally provide model_artifacts_path to load a pre-trained model.

predictions = model.predict(
    text="Kui Arno isaga ...",  # Text to classify
    lang_code="et",             # (Optional) Language of the text, in ISO 639-1 format, if not provided, language is detected automatically
    top_n=10,                   # Number of top predictions to return
    merge_threshold=0.0,        # (Optional) Threshold for merging words into a single keyword. If 0.0 no words are merged.
    use_phraser=True,           # (Optional) Whether to use phraser or not. Available Phraser models must be defined in constants.py
    correct_spelling=True,      # (Optional) Whether to use spell correction or not.
    preserve_case=True,         # (Optional) Whether to preserve original case or not.
    max_uppercase=2,            # (Optional) The maximum number of uppercase letters in the word to allow spelling correction.
    min_word_frequency=3,       # (Optional) The minimum frequency of the word in the input text required for it to NOT be corrected using spelling correction.
)  # Output: ['koolimaja']

A sample code snippet to train and predict using the Phraser model is provided below:

from src.unsupervised.phraser_model import PhraserModel

model = PhraserModel()

model.train(
    train_data_path=".../train.txt",  # File with one document per line, text should be lemmatised.
    lang_code="et",                      # Language of the text, in ISO 639-1 format
    min_count=5,                         # (Optional) Minimum word frequency for phrase formation.
    threshold=10.0                       # (Optional) Score threshold for forming phrases.
)

predictions = model.predict(
    text="'vabariik aastapäev sööma kiluvõileib'",  # Lemmatised text for phrase detection
)  # Output: ['vabariik_aastapäev', 'sööma', kiluvõileib']

📂 Data Format

The file provided to the PhraserModel train function should be in the following format:

  • A text file (.txt) where each line is a document.
    Document one content.
    Document two content.
    

🛠 Components Overview

Component Description
KeywordExtractor Extracts topic keywords from the text using unsupervised methods. Optionally multi-word keywords can be found using a pretrained PhraserModel. Spelling mistakes can be automatically corrected using SpellCorrector.
PhraserModel Handles Gensim Phraser model training and evaluation.
SpellCorrector Handles spelling correction logic using SymSpell.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rara_subject_indexer-0.0.1.tar.gz (55.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rara_subject_indexer-0.0.1-py3-none-any.whl (47.6 kB view details)

Uploaded Python 3

File details

Details for the file rara_subject_indexer-0.0.1.tar.gz.

File metadata

  • Download URL: rara_subject_indexer-0.0.1.tar.gz
  • Upload date:
  • Size: 55.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rara_subject_indexer-0.0.1.tar.gz
Algorithm Hash digest
SHA256 12950c13721bd5a1060a029d1d4950996fb9e52eb10562ae8d2b9c29f9b4ba96
MD5 d3feb9feda56ba13c9b112001ed2a936
BLAKE2b-256 3143864db661db3f363e48822511b93044c4056627d051b3fcc0615f8f0b8f67

See more details on using hashes here.

File details

Details for the file rara_subject_indexer-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for rara_subject_indexer-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0a29ec8e35ed356c37eec0d6dee74465fbf181a38ca842d7274e20c88151fc93
MD5 d8d11fd931d44f24cc3ff2943ff4c984
BLAKE2b-256 64677de87e093c45a375635f21fe458d7ad154f4db1ff1574d82ff7502b2a362

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page