Skip to main content

Automatically detect subject indices.

Project description

RaRa Subject Indexer

Py3.10 Py3.11 Py3.12

rara-subject-indexer is a Python library for predicting subject indices (keywords) for textual inputs.


✨ Features

  • Predict subject indices of following types: personal names, organizations, titles of work, locations, events, topics, UDC Summary, UDC National Bibliography, times, genres/form, EMS categories.
  • Supports subject indexing texts in Estonian and English.
  • Use Omikuji for supervised subject indexing.
  • Use RaKUn for unsupervised subject indexing.
  • Use StanzaNER and/or GLiNER for NER-based subject indexing.
  • Train new Omikuji models.

⚡ Quick Start

Get started with rara-subject-indexer in just a few steps:

  1. Install the Package
    Ensure you're using Python 3.10 or above, then run:

    pip install rara-subject-indexer
    
  2. Import and Use
    Example usage for finding subject indices with default configuration:

    from rara_subject_indexer.rara_indexer import RaraSubjectIndexer
    from pprint import pprint
    
    # If this is your first usage, download relevant models:
    # NB! This has to be done only once!
    RaraSubjectIndexer.download_resources()
    
    # Initialize the instance with default configuration
    rara_indexer = RaraSubjectIndexer()
    
    # Just a dummy text, use a longer one to get some meaningful results
    text = "Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud."
    
    subject_indices = rara_indexer.apply_indexers(text=text)
    pprint(subject_indices)
    


⚙️ Installation Guide

Follow the steps below to install the rara-subject-indexer package, either via pip or locally.


Installation via pip

Click to expand
  1. Set Up Your Python Environment
    Create or activate a Python environment using Python 3.10 or above.

  2. Install the Package
    Run the following command:

    pip install rara-subject-indexer
    

Local Installation

Follow these steps to install the rara-subject-indexer package locally:

Click to expand
  1. Clone the Repository
    Clone the repository and navigate into it:

    git clone <repository-url>
    cd <repository-directory>
    
  2. Set Up Python Environment
    Create or activate a Python environment using Python 3.10 or above. E.g:

    conda create -n py310 python==3.10
    conda activate py310
    
  3. Install Build Package
    Install the build package to enable local builds:

    pip install build
    
  4. Build the Package
    Run the following command inside the repository:

    python -m build
    
  5. Install the Package
    Install the built package locally:

    pip install .
    

📝 Testing

Click to expand
  1. Clone the Repository
    Clone the repository and navigate into it:

    git clone <repository-url>
    cd <repository-directory>
    
  2. Set Up Python Environment
    Create or activate a Python environment using Python 3.10 or above.

  3. Install Build Package
    Install the build package:

    pip install build
    
  4. Build the Package
    Build the package inside the repository:

    python -m build
    
  5. Install with Testing Dependencies
    Install the package along with its testing dependencies:

    pip install .[testing]
    
  6. Run Tests
    Run the test suite from the repository root:

    python -m pytest -v tests
    

📚 Documentation

Click to expand

🔍 RaraSubjectIndexer Class

Overview

RaraSubjectIndexer wraps all logic of different models and keyword types.

Parameters
Name Type Optional Default Description
methods Dict[str, List[str]] True DEFAULT_METHOD_MAP Methods to use per each keyword type. See ALLOWED_METHODS for a list of supported methods of each keyword type.
keyword_types List[str] True DEFAULT_KEYWORD_TYPES Keyword (subject index) types to predict. See ALLOWED_KEYWORD_TYPES for a list of supported methods of each keyword type.
topic_config dict True DEFAULT_TOPIC_CONFIG Configuration for topic subject indexing models.
time_config dict True DEFAULT_TIME_CONFIG Configuration for time subject indexing models.
genre_config dict True DEFAULT_GENRE_CONFIG Configuration for genre/form subject indexing models.
category_config dict True DEFAULT_CATEGORY_CONFIG Configuration for EMS category prediction models.
udc_config dict True DEFAULT_UDC_CONFIG Configuration for UDC (National Bibliography) prediction models.
udc2_config dict True DEFAULT_UDC2_CONFIG Configuration for UDC Summary models.
ner_config dict True DEFAULT_NER_CONFIG Configuration for NER-based subject indexing models.
Default configurations

DEFAULT_METHOD_MAP:

 {
    "Teemamärksõnad": ["omikuji", "rakun"],
    "Kohamärksõnad": ["ner_ensemble"],
    "Isikunimi": ["ner_ensemble"], 
    "Kollektiivi nimi": ["ner_ensemble"],
    "Kohamärksõnad": ["ner_ensemble"],
    "Ajamärksõnad": ["omikuji"],
    "Teose pealkiri": ["gliner"],
    "UDK Rahvusbibliograafia": ["omikuji"],
    "UDC Summary": ["omikuji"],
    "Vormimärksõnad": ["omikuji"],
    "Valdkonnamärksõnad": ["omikuji"],
    "NER": ["ner"],
    "Ajutine kollektiiv või sündmus": ["gliner"]     
}

DEFAULT_KEYWORD_TYPES:

[
    "Teemamärksõnad",
    "Kohamärksõnad",
    "Isikunimi",
    "Kollektiivi nimi",
    "Kohamärksõnad",
    "Ajamärksõnad",
    "Teose pealkiri",
    "UDK Rahvusbibliograafia",
    "UDC Summary",
    "Vormimärksõnad",
    "Valdkonnamärksõnad",
    "Ajutine kollektiiv või sündmus"
]

DEFAULT_TOPIC_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/teemamarksonad_est"
        "en": "./rara_subject_indexer/data/omikuji_models/teemamarksonad_eng"
    }
    "rakun": {
        "stopwords": {
            "et": <list of stopwords loaded from "rara_subject_indexer/resources/stopwords/et_stopwords_lemmas.txt">,
            "en": <list of stopwords loaded from "rara_subject_indexer/resources/stopwords/et_stopwords.txt">,
        },
        "n_raw_keywords": 30
    }
}

DEFAULT_TIME_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/ajamarksonad_est"
        "en": "./rara_subject_indexer/data/omikuji_models/ajamarksonad_eng"
    }
    "rakun": {}
}

DEFAULT_GENRE_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/vormimarksonad_est"
        "en": "./rara_subject_indexer/data/omikuji_models/vormimarksonad_eng"
    }
    "rakun": {}
}

DEFAULT_CATEGORY_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/valdkonnamarksonad_est"
        "en": "./rara_subject_indexer/data/omikuji_models/valdkonnamarksonad_eng"
    }
    "rakun": {}
}

DEFAULT_UDC_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/udk_rahvbibl_est"
        "en": "./rara_subject_indexer/data/omikuji_models/udk_rahvbibl_eng"
    }
    "rakun": {}
}

DEFAULT_UDC2_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/udk_general_depth_11_est"
        "en": "./rara_subject_indexer/data/omikuji_models/udk_general_depth_11_eng"
    }
    "rakun": {}
}

DEFAULT_NER_CONFIG:

 {
    "ner": {
        "stanza_config": {
            "resource_dir": "./rara_subject_indexer/data/ner_resources/",
            "download_resources": False,
            "supported_languages": ["et", "en"],
            "custom_ner_model_langs": ["et"],
            "refresh_data": False,
            "custom_ner_models": {
                "et": "https://packages.texta.ee/texta-resources/ner_models/_estonian_nertagger.pt"
            },
            "unknown_lang_token": "unk"   
        }

        "gliner_config": {
            "labels": ["Person", "Organization", "Location", "Title of a work", "Date", "Event"], 
            "model_name": "urchade/gliner_multi-v2.1",
            "multi_label": False,
            "resource_dir": "./rara_subject_indexer/data/ner_resources/",
            "threshold": 0.5,
            "device": "cpu"
        },
        "ner_method_map": {
            "PER": "ner_ensemble",
            "ORG": "ner_ensemble",
            "LOC": "ner_ensemble",
            "TITLE": "gliner",
            "EVENT": "gliner"
        }
    }
}
Key Functions

Coming soon


Training Supervised and Unsupervised Models

If necessary, you can train the supervised and unsupervised models from scratch using the provided pipelines. The training process involves reading text and label files, preprocessing the text, and training the models using the extracted features.

Training an Omikuji Model for Supervised Keyword Extraction

A sample code snippet to train and predict using the Omikuji model is provided below:

from rara_subject_indexer.supervised.omikuji.omikuji_model import OmikujiModel

model = OmikujiModel()

model.train(
    text_file="texts.txt",         # File with one document per line
    label_file="labels.txt",       # File with semicolon-separated labels for each document
    language="et",                 # Language of the text, in ISO 639-1 format
    entity_type="Teemamärksõnad",  # Entity type for the keywords
    lemmatization_required=True,   # (Optional) Whether to lemmatize the text - only set False if text_file is already lemmatized
    max_features=20000,            # (Optional) Maximum number of features for TF-IDF extraction
    keep_train_file=False,         # (Optional) Whether to retain intermediate training files
    eval_split=0.1                 # (Optional) Proportion of the dataset used for evaluation
)

predictions = model.predict(
    text="Kui Arno isaga koolimajja jõudis",  # Text to classify
    top_k=3  # Number of top predictions to return
)  # Output: [('koolimajad', 0.262), ('isad', 0.134), ('õpilased', 0.062)]
📂 Data Format

The files provided to the train function should be in the following format:

  • A text file (.txt) where each line is a document.
    Document one content.
    Document two content.
    
  • A label file (.txt) where each line contains semicolon-separated labels corresponding to the text file.
    label1;label2
    label3;label4
    

Training Phraser for Unsupervised Keyword Extraction

A sample code snippet to train and predict using the Phraser model is provided below:

from rara_subject_indexer.utils.phraser_model import PhraserModel

model = PhraserModel()

model.train(
    train_data_path=".../train.txt",  # File with one document per line, text should be lemmatised.
    lang_code="et",                   # Language of the text, in ISO 639-1 format
    min_count=5,                      # (Optional) Minimum word frequency for phrase formation.
    threshold=10.0                    # (Optional) Score threshold for forming phrases.
)

predictions = model.predict(
    text="'vabariik aastapäev sööma kiluvõileib'",  # Lemmatised text for phrase detection
)  # Output: ['vabariik_aastapäev', 'sööma', kiluvõileib']
📂 Data Format

The file provided to the PhraserModel train function should be in the following format:

  • A text file (.txt) where each line is a document.
    Document one content.
    Document two content.
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rara_subject_indexer-2.0.1.tar.gz (91.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rara_subject_indexer-2.0.1-py3-none-any.whl (93.0 kB view details)

Uploaded Python 3

File details

Details for the file rara_subject_indexer-2.0.1.tar.gz.

File metadata

  • Download URL: rara_subject_indexer-2.0.1.tar.gz
  • Upload date:
  • Size: 91.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rara_subject_indexer-2.0.1.tar.gz
Algorithm Hash digest
SHA256 4e969f475be5e913f6fe7c5bd486c86a1f7d0cdc4287caba1fde6dfb35dca33c
MD5 0e50818be5daedac47320aec56ae9ba0
BLAKE2b-256 5a075b5681dcedb5309c3141eb41bc3275ca012a086ca8de5d741b562c6491a7

See more details on using hashes here.

File details

Details for the file rara_subject_indexer-2.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for rara_subject_indexer-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9a0abe17c53b89b5b206d3a3ae2ff930e10111f1ef852c283838ff18d8a31245
MD5 fd33dfff49286a57c52eb4799e2854c7
BLAKE2b-256 98870a7e67231410d7586b832eb5ccc95e9066615f36e83bf8c650b22cd543a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page