Skip to main content

A library for match labels of thesaurus concepts to text and assigning scores to found occurrences.

Project description

stwfsapy

Build Status codecov

About

This library provides functionality to find the labels of SKOS thesaurus concepts in text. It is a reimplementation in Python of stwfsa combined with the concept scoring from [1]. A deterministic finite automaton is constructed from the labels of the thesaurus concepts to perform the matching. In addition, a classifier is trained to score the matched occurrences of the concepts.

Data Requirements

The construction of the automaton requires a SKOS thesaurus represented as a rdflib Graph. Concepts should be related to labels by skos:prefLabel or skos:altLabel. Concepts have to be identifiable by rdf:type. The training of the predictor requires labeled text. Each training sample should be annotated with one or more concepts from the thesaurus.

Usage

Create predictor

First load your thesaurus.

from rdflib import Graph

g = Graph()
g.load('/path/to/your/thesaurus')

First, define the type URI for descriptors. If your thesaurus is structured into sub-thesauri by providing categories for the concepts of the thesaurus using, e.g., skos:Collection, you can optionally specify the type of these categories via a URI. In this case you should also specify the relation that relates concepts to categories. Furthermore you can indicate whether this relation is a specialisation relation (as opposed to a generalisation relation, which is the default). For the STW this would be

descriptor_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Descriptor'
thsys_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Thsys'
thesaurus_relation_type_uri = 'http://www.w3.org/2004/02/skos/core#broader'
is_specialisation = False

Create the predictor

from stwfsapy.predictor import StwfsapyPredictor
p = StwfsapyPredictor(
    g,
    descriptor_type_uri,
    thsys_type_uri,
    thesaurus_relation_type_uri,
    is_specialisation,
    langs={'en'},
    simple_english_plural_rules=True)

The next step assumes you have loaded your texts into a list X and your labels into a list of lists y, such that for all indices 0 <= i < len(X). The list at y[i] contains the URIs to the correct concepts for X[i]. The concepts should be given by their URI. Then you can train the classifier:

p.fit(X, y)

Afterwards you can get the predicted concepts and scores:

p.suggest_proba(['one input text', 'A completely different input text.'])

Alternatively you can get a sparse matrix of scores by calling

p.predict_proba(['one input text', 'Another input text.'])

The indices of the concepts are stored in p.concept_map_.

Options

Input Type

The StwfsapyPredictor class has an option input that allows it to handle different types of inputs in the feature argument X of transform and fit methods.

  • "content" expects string input. This is the default.
  • "file" expects python file handles.
  • "filename" expects paths to files.

Text Vectorizer

StwfsapyPredictor can optionally use TFIDF features of the input texts to score the matches found by the finite state automaton. However this uses a lot of memory. Therefore it is disabled by default.

Save Model

A trained predictor p can be stored by calling p.store('/path/to/storage/location'). Afterwards it can be loaded as follows:

from stwfsapy.predictor import StwfsapyPredictor

StwfsapyPredictor.load('/path/to/storage/location')

References

[1] Toepfer, Martin, and Christin Seifert. "Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints." International Conference on Theory and Practice of Digital Libraries. Springer, Cham, 2018.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stwfsapy-0.3.0.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stwfsapy-0.3.0-py3-none-any.whl (36.9 kB view details)

Uploaded Python 3

File details

Details for the file stwfsapy-0.3.0.tar.gz.

File metadata

  • Download URL: stwfsapy-0.3.0.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.10

File hashes

Hashes for stwfsapy-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c94ff7f1fcf594528434a4e31eb1e21578a15ef1220ca0dfb489460d814d4634
MD5 bda9eab7f3805a4a8cc21ceb384671b6
BLAKE2b-256 b0763feb9a6c25e8e7c821b7d39f32a9814b2657a5132644ef56d6bebd0fef89

See more details on using hashes here.

File details

Details for the file stwfsapy-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: stwfsapy-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 36.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.10

File hashes

Hashes for stwfsapy-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 36ad18919d7996cca33e2ae544d1d5904c1bcf1e96f41696606879099be87739
MD5 a8b8fd9104bd19b55b8effc6c45c5056
BLAKE2b-256 eb95c90a87f0c4dcb46bc254c1e78d304736ff6f56014674617c0c282455af4a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page