Skip to main content

A library for match labels of thesaurus concepts to text and assigning scores to found occurrences.

Project description

stwfsapy

CI codecov

About

This library provides the functionality to find SKOS thesaurus concepts in a text. It is a reimplementation in Python of stwfsa combined with the concept scoring from [1]. A deterministic finite automaton is constructed from the labels of the thesaurus concepts to perform the matching. In addition, a classifier is trained to score the matched concept occurrences.

Data Requirements

The construction of the automaton requires a SKOS thesaurus represented as a rdflib Graph. Concepts should be related to labels by skos:prefLabel, skos:altLabel, zbwext:altLabelNarrower, zbwext:altLabelRelated or skos:hiddenLabel. Concepts have to be identifiable by rdf:type. The training of the predictor requires annotated text. Each training sample should be annotated with one or more concepts from the thesaurus.

Installation

Requirements

Python >= 3.9 is required.

With pip

stwfsapy is available on PyPI . You can install stwfsapy using pip:

pip install stwfsapy

This will install a python package called stwfsapy.

Note that it is generally recommended to use a virtual environment to avoid conflicting behaviour with the system package manager.

From source

You also have the option to checkout the repository and install the packages from source. You need poetry to perform the task:

# call inside the project directory
poetry install --without ci 

Usage

Create predictor

First load your thesaurus.

from rdflib import Graph

g = Graph()
g.parse('/path/to/your/thesaurus')

First, define the type URI for descriptors. If your thesaurus is structured into sub-thesauri by providing categories for the concepts of the thesaurus using, e.g., skos:Collection, you can optionally specify the type of these categories via a URI. In this case you should also specify the relation that relates concepts to categories. Furthermore you can indicate whether this relation is a specialisation relation (as opposed to a generalisation relation, which is the default). For the STW this would be

descriptor_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Descriptor'
thsys_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Thsys'
thesaurus_relation_type_uri = 'http://www.w3.org/2004/02/skos/core#broader'
is_specialisation = False

Create the predictor

from stwfsapy.predictor import StwfsapyPredictor
p = StwfsapyPredictor(
    g,
    descriptor_type_uri,
    thsys_type_uri,
    thesaurus_relation_type_uri,
    is_specialisation,
    langs={'en'},
    simple_english_plural_rules=True)

The next step assumes you have loaded your texts into a list X and your labels into a list of lists y, such that for all indices 0 <= i < len(X). The list at y[i] contains the URIs to the correct concepts for X[i]. The concepts should be given by their URI. Then you can train the classifier:

p.fit(X, y)

Afterwards you can get the predicted concepts and scores:

p.suggest_proba(['one input text', 'A completely different input text.'])

Alternatively you can get a sparse matrix of scores by calling

p.predict_proba(['one input text', 'Another input text.'])

The indices of the concepts are stored in p.concept_map_.

Options

All options for the predictor are documented at https://stwfsapy.readthedocs.io/ .

Save Model

A trained predictor p can be stored by calling p.store('/path/to/storage/location'). Afterwards it can be loaded as follows:

from stwfsapy.predictor import StwfsapyPredictor

StwfsapyPredictor.load('/path/to/storage/location')

Contribute

Contributions via pull requests are welcome. Please create an issue beforehand to explain and discuss the reasons for the respective contribution.

References

[1] Toepfer, Martin, and Christin Seifert. "Fusion architectures for automatic subject indexing under concept drift" International Journal on Digital Libraries (IJDL), 2018.

Context information

This code was created as part of the subject indexing automation effort at ZBW – Leibniz Information Centre for Economics. See our homepage for more information, publications, and contact details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stwfsapy-0.5.1.tar.gz (43.9 kB view details)

Uploaded Source

Built Distribution

stwfsapy-0.5.1-py3-none-any.whl (77.5 kB view details)

Uploaded Python 3

File details

Details for the file stwfsapy-0.5.1.tar.gz.

File metadata

  • Download URL: stwfsapy-0.5.1.tar.gz
  • Upload date:
  • Size: 43.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for stwfsapy-0.5.1.tar.gz
Algorithm Hash digest
SHA256 996be36f761714177e1cbecfa23ca69a1bd05b33dad8dd57bc230757a34be0ed
MD5 2f36c6305e20b9eae053b9e22ec67f02
BLAKE2b-256 8599e3550c0e040d84edd2387623661822f41b703a0cbcc85a2a899b33c34dbd

See more details on using hashes here.

File details

Details for the file stwfsapy-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: stwfsapy-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 77.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for stwfsapy-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2e9835494e4fe65c6e1a39e030b15eaa8934a350b1b2d9e9bc3c9ccb7cd0e744
MD5 e3cf844797c13c9cd5f1e760bb079116
BLAKE2b-256 3ab50bc8034f4ce926f7ac7aee6347549ac694f03da58d6c445e7b750de7f817

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page