Automatically detect subject indices.

These details have not been verified by PyPI

Project description

RaRa Subject Indexer

Py3.10 Py3.11 Py3.12

rara-subject-indexer is a Python library for predicting subject indices (keywords) for textual inputs.

✨ Features

Predict subject indices of following types: personal names, organizations, titles of work, locations, events, topics, UDC Summary, UDC National Bibliography, times, genres/form, EMS categories.
Supports subject indexing texts in Estonian and English.
Use Omikuji for supervised subject indexing.
Use RaKUn for unsupervised subject indexing.
Use StanzaNER and/or GLiNER for NER-based subject indexing.
Train new Omikuji models.

⚡ Quick Start

Get started with rara-subject-indexer in just a few steps:

Install the Package
Ensure you're using Python 3.10 or above, then run:
```
pip install rara-subject-indexer
```

Import and Use
Example usage for finding subject indices with default configuration:

from rara_subject_indexer.rara_indexer import RaraSubjectIndexer
from pprint import pprint

# If this is your first usage, download relevant models:
# NB! This has to be done only once!
RaraSubjectIndexer.download_resources()

# Initialize the instance with default configuration
rara_indexer = RaraSubjectIndexer()

# Just a dummy text, use a longer one to get some meaningful results
text = "Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud."

subject_indices = rara_indexer.apply_indexers(text=text)
pprint(subject_indices)

⚙️ Installation Guide

Follow the steps below to install the rara-subject-indexer package, either via pip or locally.

Installation via `pip`

Click to expand

Set Up Your Python Environment
Create or activate a Python environment using Python 3.10 or above.
Install the Package
Run the following command:
```
pip install rara-subject-indexer
```

Local Installation

Follow these steps to install the rara-subject-indexer package locally:

Click to expand

Clone the Repository
Clone the repository and navigate into it:
```
git clone <repository-url>
cd <repository-directory>
```
Set Up Python Environment
Create or activate a Python environment using Python 3.10 or above. E.g:
```
conda create -n py310 python==3.10
conda activate py310
```
Install Build Package
Install the build package to enable local builds:
```
pip install build
```
Build the Package
Run the following command inside the repository:
```
python -m build
```
Install the Package
Install the built package locally:
```
pip install .
```

📝 Testing

Click to expand

Clone the Repository
Clone the repository and navigate into it:
```
git clone <repository-url>
cd <repository-directory>
```
Set Up Python Environment
Create or activate a Python environment using Python 3.10 or above.
Install Build Package
Install the build package:
```
pip install build
```
Build the Package
Build the package inside the repository:
```
python -m build
```
Install with Testing Dependencies
Install the package along with its testing dependencies:
```
pip install .[testing]
```
Run Tests
Run the test suite from the repository root:
```
python -m pytest -v tests
```

📚 Documentation

Click to expand

🔍 RaraSubjectIndexer Class

Overview

RaraSubjectIndexer wraps all logic of different models and keyword types.

Parameters

Name	Type	Optional	Default	Description
methods	Dict[str, List[str]]	True	DEFAULT_METHOD_MAP	Methods to use per each keyword type. See ALLOWED_METHODS for a list of supported methods of each keyword type.
keyword_types	List[str]	True	DEFAULT_KEYWORD_TYPES	Keyword (subject index) types to predict. See ALLOWED_KEYWORD_TYPES for a list of supported methods of each keyword type.
topic_config	dict	True	DEFAULT_TOPIC_CONFIG	Configuration for topic subject indexing models.
time_config	dict	True	DEFAULT_TIME_CONFIG	Configuration for time subject indexing models.
genre_config	dict	True	DEFAULT_GENRE_CONFIG	Configuration for genre/form subject indexing models.
category_config	dict	True	DEFAULT_CATEGORY_CONFIG	Configuration for EMS category prediction models.
udc_config	dict	True	DEFAULT_UDC_CONFIG	Configuration for UDC (National Bibliography) prediction models.
udc2_config	dict	True	DEFAULT_UDC2_CONFIG	Configuration for UDC Summary models.
ner_config	dict	True	DEFAULT_NER_CONFIG	Configuration for NER-based subject indexing models.

Default configurations

DEFAULT_METHOD_MAP:

 {
    "Teemamärksõnad": ["omikuji", "rakun"],
    "Kohamärksõnad": ["ner_ensemble"],
    "Isikunimi": ["ner_ensemble"], 
    "Kollektiivi nimi": ["ner_ensemble"],
    "Kohamärksõnad": ["ner_ensemble"],
    "Ajamärksõnad": ["omikuji"],
    "Teose pealkiri": ["gliner"],
    "UDK Rahvusbibliograafia": ["omikuji"],
    "UDC Summary": ["omikuji"],
    "Vormimärksõnad": ["omikuji"],
    "Valdkonnamärksõnad": ["omikuji"],
    "NER": ["ner"],
    "Ajutine kollektiiv või sündmus": ["gliner"]     
}

DEFAULT_KEYWORD_TYPES:

[
    "Teemamärksõnad",
    "Kohamärksõnad",
    "Isikunimi",
    "Kollektiivi nimi",
    "Kohamärksõnad",
    "Ajamärksõnad",
    "Teose pealkiri",
    "UDK Rahvusbibliograafia",
    "UDC Summary",
    "Vormimärksõnad",
    "Valdkonnamärksõnad",
    "Ajutine kollektiiv või sündmus"
]

DEFAULT_TOPIC_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/teemamarksonad_est"
        "en": "./rara_subject_indexer/data/omikuji_models/teemamarksonad_eng"
    }
    "rakun": {
        "stopwords": {
            "et": <list of stopwords loaded from "rara_subject_indexer/resources/stopwords/et_stopwords_lemmas.txt">,
            "en": <list of stopwords loaded from "rara_subject_indexer/resources/stopwords/et_stopwords.txt">,
        },
        "n_raw_keywords": 30
    }
}

DEFAULT_TIME_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/ajamarksonad_est"
        "en": "./rara_subject_indexer/data/omikuji_models/ajamarksonad_eng"
    }
    "rakun": {}
}

DEFAULT_GENRE_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/vormimarksonad_est"
        "en": "./rara_subject_indexer/data/omikuji_models/vormimarksonad_eng"
    }
    "rakun": {}
}

DEFAULT_CATEGORY_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/valdkonnamarksonad_est"
        "en": "./rara_subject_indexer/data/omikuji_models/valdkonnamarksonad_eng"
    }
    "rakun": {}
}

DEFAULT_UDC_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/udk_rahvbibl_est"
        "en": "./rara_subject_indexer/data/omikuji_models/udk_rahvbibl_eng"
    }
    "rakun": {}
}

DEFAULT_UDC2_CONFIG:

 {
    "omikuji": {
        "et": "./rara_subject_indexer/data/omikuji_models/udk_general_depth_11_est"
        "en": "./rara_subject_indexer/data/omikuji_models/udk_general_depth_11_eng"
    }
    "rakun": {}
}

DEFAULT_NER_CONFIG:

 {
    "ner": {
        "stanza_config": {
            "resource_dir": "./rara_subject_indexer/data/ner_resources/",
            "download_resources": False,
            "supported_languages": ["et", "en"],
            "custom_ner_model_langs": ["et"],
            "refresh_data": False,
            "custom_ner_models": {
                "et": "https://packages.texta.ee/texta-resources/ner_models/_estonian_nertagger.pt"
            },
            "unknown_lang_token": "unk"   
        }

        "gliner_config": {
            "labels": ["Person", "Organization", "Location", "Title of a work", "Date", "Event"], 
            "model_name": "urchade/gliner_multi-v2.1",
            "multi_label": False,
            "resource_dir": "./rara_subject_indexer/data/ner_resources/",
            "threshold": 0.5,
            "device": "cpu"
        },
        "ner_method_map": {
            "PER": "ner_ensemble",
            "ORG": "ner_ensemble",
            "LOC": "ner_ensemble",
            "TITLE": "gliner",
            "EVENT": "gliner"
        }
    }
}

Key Functions

Coming soon

Training Supervised and Unsupervised Models

If necessary, you can train the supervised and unsupervised models from scratch using the provided pipelines. The training process involves reading text and label files, preprocessing the text, and training the models using the extracted features.

Training an Omikuji Model for Supervised Keyword Extraction

A sample code snippet to train and predict using the Omikuji model is provided below:

from rara_subject_indexer.supervised.omikuji.omikuji_model import OmikujiModel

model = OmikujiModel()

model.train(
    text_file="texts.txt",         # File with one document per line
    label_file="labels.txt",       # File with semicolon-separated labels for each document
    language="et",                 # Language of the text, in ISO 639-1 format
    entity_type="Teemamärksõnad",  # Entity type for the keywords
    lemmatization_required=True,   # (Optional) Whether to lemmatize the text - only set False if text_file is already lemmatized
    max_features=20000,            # (Optional) Maximum number of features for TF-IDF extraction
    keep_train_file=False,         # (Optional) Whether to retain intermediate training files
    eval_split=0.1                 # (Optional) Proportion of the dataset used for evaluation
)

predictions = model.predict(
    text="Kui Arno isaga koolimajja jõudis",  # Text to classify
    top_k=3  # Number of top predictions to return
)  # Output: [('koolimajad', 0.262), ('isad', 0.134), ('õpilased', 0.062)]

📂 Data Format

The files provided to the train function should be in the following format:

A text file (.txt) where each line is a document.
```
Document one content.
Document two content.
```
A label file (.txt) where each line contains semicolon-separated labels corresponding to the text file.
```
label1;label2
label3;label4
```

Training Phraser for Unsupervised Keyword Extraction

A sample code snippet to train and predict using the Phraser model is provided below:

from rara_subject_indexer.utils.phraser_model import PhraserModel

model = PhraserModel()

model.train(
    train_data_path=".../train.txt",  # File with one document per line, text should be lemmatised.
    lang_code="et",                   # Language of the text, in ISO 639-1 format
    min_count=5,                      # (Optional) Minimum word frequency for phrase formation.
    threshold=10.0                    # (Optional) Score threshold for forming phrases.
)

predictions = model.predict(
    text="'vabariik aastapäev sööma kiluvõileib'",  # Lemmatised text for phrase detection
)  # Output: ['vabariik_aastapäev', 'sööma', kiluvõileib']

📂 Data Format

The file provided to the PhraserModel train function should be in the following format:

A text file (.txt) where each line is a document.
```
Document one content.
Document two content.
```

Project details

These details have not been verified by PyPI

Intended Audience
- Science/Research
Programming Language

Release history Release notifications | RSS feed

3.0.33

Jan 13, 2026

3.0.32

Oct 24, 2025

3.0.31

Sep 19, 2025

3.0.30

Sep 18, 2025

3.0.29

Sep 16, 2025

3.0.28

Aug 6, 2025

3.0.27

Aug 5, 2025

3.0.26

Aug 5, 2025

3.0.25

Aug 4, 2025

3.0.24

Aug 1, 2025

3.0.23

Jul 31, 2025

3.0.22

Jul 31, 2025

3.0.21

Jul 30, 2025

3.0.20

Jul 29, 2025

3.0.19

Jul 29, 2025

3.0.18

Jul 15, 2025

3.0.17

Jul 8, 2025

3.0.16

Jul 4, 2025

3.0.15

Jul 3, 2025

3.0.14

Jul 3, 2025

3.0.13

Jul 2, 2025

3.0.12

Jun 19, 2025

3.0.11

Jun 6, 2025

3.0.10

Jun 4, 2025

3.0.9

Jun 4, 2025

3.0.8

Jun 3, 2025

3.0.6

May 26, 2025

3.0.5

May 20, 2025

3.0.4

May 17, 2025

3.0.3

May 13, 2025

3.0.2

May 7, 2025

3.0.1

May 7, 2025

3.0.0

Apr 18, 2025

2.0.3

Apr 16, 2025

2.0.2

Apr 16, 2025

This version

2.0.1

Apr 16, 2025

2.0.0

Apr 16, 2025

1.0.0

Mar 10, 2025

0.0.5

Mar 4, 2025

0.0.4

Mar 3, 2025

0.0.3

Mar 3, 2025

0.0.2

Mar 3, 2025

0.0.1

Feb 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rara_subject_indexer-2.0.1.tar.gz (91.7 kB view details)

Uploaded Apr 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rara_subject_indexer-2.0.1-py3-none-any.whl (93.0 kB view details)

Uploaded Apr 16, 2025 Python 3

File details

Details for the file rara_subject_indexer-2.0.1.tar.gz.

File metadata

Download URL: rara_subject_indexer-2.0.1.tar.gz
Upload date: Apr 16, 2025
Size: 91.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rara_subject_indexer-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`4e969f475be5e913f6fe7c5bd486c86a1f7d0cdc4287caba1fde6dfb35dca33c`
MD5	`0e50818be5daedac47320aec56ae9ba0`
BLAKE2b-256	`5a075b5681dcedb5309c3141eb41bc3275ca012a086ca8de5d741b562c6491a7`

See more details on using hashes here.

File details

Details for the file rara_subject_indexer-2.0.1-py3-none-any.whl.

File metadata

Download URL: rara_subject_indexer-2.0.1-py3-none-any.whl
Upload date: Apr 16, 2025
Size: 93.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rara_subject_indexer-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9a0abe17c53b89b5b206d3a3ae2ff930e10111f1ef852c283838ff18d8a31245`
MD5	`fd33dfff49286a57c52eb4799e2854c7`
BLAKE2b-256	`98870a7e67231410d7586b832eb5ccc95e9066615f36e83bf8c650b22cd543a1`

See more details on using hashes here.

rara-subject-indexer 2.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

RaRa Subject Indexer

✨ Features

⚡ Quick Start

⚙️ Installation Guide

Installation via pip

Local Installation

📝 Testing

📚 Documentation

🔍 RaraSubjectIndexer Class

Overview

Parameters

Key Functions

Training Supervised and Unsupervised Models

Training an Omikuji Model for Supervised Keyword Extraction

📂 Data Format

Training Phraser for Unsupervised Keyword Extraction

📂 Data Format

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Installation via `pip`