Skip to main content

A library for match labels of thesaurus concepts to text and assigning scores to found occurrences.

Project description

stwfsapy

CI codecov Code style: black Ruff security: bandit readthedocs

About

This library provides the functionality to find SKOS thesaurus concepts in a text. It is a reimplementation in Python of stwfsa combined with the concept scoring from [1]. A deterministic finite automaton is constructed from the labels of the thesaurus concepts to perform the matching. In addition, a classifier is trained to score the matched concept occurrences.

Data Requirements

The construction of the automaton requires a SKOS thesaurus represented as a rdflib Graph. Concepts should be related to labels by one of the relations skos:prefLabel, skos:altLabel, or skos:hiddenLabel. (This implementation also includes zbwext:altLabelNarrower and zbwext:altLabelRelated as possible concept-label relations which are specific to ZBW.) Concepts have to be identifiable by rdf:type. The training of the predictor requires annotated text. Each training sample should be annotated with one or more concepts from the thesaurus.

Installation

Requirements

Python >= 3.10,<3.14 is required.

With pip

stwfsapy is available on PyPI . You can install stwfsapy using pip:

pip install stwfsapy

This will install a python package called stwfsapy.

Note that it is generally recommended to use a virtual environment to avoid conflicting behaviour with the system package manager.

From source

You also have the option to checkout the repository and install the packages from source. You need uv to perform the task:

# call inside the project directory
uv sync --no-group ci

Usage

Create predictor

First load your thesaurus.

from rdflib import Graph

g = Graph()
g.parse('/path/to/your/thesaurus')

First, define the type URI for descriptors. If your thesaurus is structured into sub-thesauri by providing categories for the concepts of the thesaurus using, e.g., skos:Collection, you can optionally specify the type of these categories via a URI. In this case you should also specify the relation that relates concepts to categories. Furthermore you can indicate whether this relation is a specialisation relation (as opposed to a generalisation relation, which is the default). For the STW this would be

descriptor_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Descriptor'
thsys_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Thsys'
thesaurus_relation_type_uri = 'http://www.w3.org/2004/02/skos/core#broader'
is_specialisation = False

Create the predictor

from stwfsapy.predictor import StwfsapyPredictor
p = StwfsapyPredictor(
    g,
    descriptor_type_uri,
    thsys_type_uri,
    thesaurus_relation_type_uri,
    is_specialisation,
    langs={'en'},
    simple_english_plural_rules=True)

The next step assumes you have loaded your texts into a list X and your labels into a list of lists y, such that for all indices 0 <= i < len(X). The list at y[i] contains the URIs to the correct concepts for X[i]. The concepts should be given by their URI. Then you can train the classifier:

p.fit(X, y)

Afterwards you can get the predicted concepts and scores:

p.suggest_proba(['one input text', 'A completely different input text.'])

Alternatively you can get a sparse matrix of scores by calling

p.predict_proba(['one input text', 'Another input text.'])

The indices of the concepts are stored in p.concept_map_.

Options

All options for the predictor are documented at https://stwfsapy-zbw.readthedocs.io .

Save Model

A trained predictor p can be stored by calling p.store('/path/to/storage/location'). Afterwards it can be loaded as follows:

from stwfsapy.predictor import StwfsapyPredictor

StwfsapyPredictor.load('/path/to/storage/location')

Contribute

Contributions via pull requests are welcome. Please create an issue beforehand to explain and discuss the reasons for the respective contribution. We recommend forking the repository, if you have not already done so, before working on any possible pull request.

stwfsapy code should follow the Black style. The Black tool is included as a development dependency; you can run black . in the project root to autoformat code. There is also the possibility of doing linting and code formatting with a Git Pre-Commit hook script. To this end a .pre-commit-config.yaml configuration file has been added. The pre-commit tool has been included as a development dependency. You would have to run the command pre-commit install inside your local virtual environment. Subsequently, the Black and Ruff tools will automatically check the linting and formatting of modified or new scripts after each time a git commit command is executed.

References

[1] Toepfer, Martin, and Christin Seifert. "Fusion architectures for automatic subject indexing under concept drift" International Journal on Digital Libraries (IJDL), 2018.

Context information

This code was created as part of the subject indexing automation effort at ZBW – Leibniz Information Centre for Economics. See our homepage for more information, publications, and contact details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stwfsapy-0.7.1.tar.gz (38.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stwfsapy-0.7.1-py3-none-any.whl (75.5 kB view details)

Uploaded Python 3

File details

Details for the file stwfsapy-0.7.1.tar.gz.

File metadata

  • Download URL: stwfsapy-0.7.1.tar.gz
  • Upload date:
  • Size: 38.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for stwfsapy-0.7.1.tar.gz
Algorithm Hash digest
SHA256 accbb73c000fe29485596bc5389478ed0192f11bb5acbfdc30fd4bf95305b7db
MD5 1d3431bd7862b71d7c921288b3d2f6ab
BLAKE2b-256 31345c2b29641be58db8c8e81f7fc8fb88616647a37319c762bba40b1138980f

See more details on using hashes here.

Provenance

The following attestation bundles were made for stwfsapy-0.7.1.tar.gz:

Publisher: publish.yml on zbw/stwfsapy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stwfsapy-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: stwfsapy-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 75.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for stwfsapy-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 35c4860b5dfee6c14c0a89bc699396f6cbdae4ebd55cf5f9ec913a0894d185af
MD5 f9e5cfd5c7b6cb7a2ccbeb3e8ccc8a0e
BLAKE2b-256 82478efafd3d98bcf44d4173689ce339639bbf7b4c489c02c232f6a1609994c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for stwfsapy-0.7.1-py3-none-any.whl:

Publisher: publish.yml on zbw/stwfsapy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page