Skip to main content

A library for match labels of thesaurus concepts to text and assigning scores to found occurrences.

Project description

stwfsapy

Build Status codecov

About

This library provides functionality to find the labels of SKOS thesaurus concepts in text. It is a reimplementation in Python of stwfsa combined with the concept scoring from [1]. A deterministic finite automata is constructed from the labels of the thesaurus concepts to perform the matching. In addition, a classifier is trained to score the matched occurrences of the concepts.

Data Requirements

The construction for the automaton requires a SKOS thesaurus represented as a rdflib Graph. Concepts should be related to labels by skos:prefLabel or skos:altLabel. In addition it is assumed that concepts are organized in a hierarchy that includes sub-thesauri. Concepts and sub-thesauri have to be distinguishable by rdf:type. The training of the predictor requires labeled text. Each training sample should be annotated with one or more concepts from the thesaurus.

Usage

Create predictor

First load your graph.

from rdflib import Graph

g = Graph()
g.load('/path/to/your/thesaurus')

Define the type URIs for descriptors and sub-thesauri. You also need to define the relationship that relates sub-thesauri to concepts. It is also beneficial if this relation structures the sub-thesauri. Furthermore you can indicate whether the thesaurus relation is a specialisation For the STW this would be

descriptor_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Descriptor'
thsys_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Thsys'
thesaurus_relation_type_uri = 'http://www.w3.org/2004/02/skos/core#broader'
is_specialisation = False

Create the predictor

from stwfsapy.predictor import StwfsapyPredictor
p = StwfsapyPredictor(
    g,
    descriptor_type_uri,
    thsys_type_uri,
    thesaurus_relation_type_uri,
    is_specialisation,
    langs={'en'},
    simple_english_plural_rules=True)

The next step assumes you have loaded your texts into a list X and your labels in a list of lists y, such that for all indices 0 <= i < len(X) The list at y[i] contains the URIs to the correct concepts for X[i]. The concepts should be given by their URI. Then you can train the classifier:

p.fit(X, y)

Afterwards you can get the predicted concepts and scores:

p.suggest_proba(['one input text', 'A completely different input text.']

Alternatively you can get a sparse matrix of scores by calling

p.predict_proba(['one input text', 'Another input text.']

The indices of the concepts are stored in p.concept_map_.

Save Model

A trained predictor p can be stored by calling p.store('/path/to/storage/location'). Afterwards it can be loaded as follows:

from stwfsapy.predictor import StwfsapyPredictor

StwfsapyPredictor.load('/path/to/storage/location')

References

[1] Toepfer, Martin, and Christin Seifert. "Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints." International Conference on Theory and Practice of Digital Libraries. Springer, Cham, 2018.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stwfsapy-0.1.2.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stwfsapy-0.1.2-py3-none-any.whl (28.8 kB view details)

Uploaded Python 3

File details

Details for the file stwfsapy-0.1.2.tar.gz.

File metadata

  • Download URL: stwfsapy-0.1.2.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.6.9

File hashes

Hashes for stwfsapy-0.1.2.tar.gz
Algorithm Hash digest
SHA256 801430767f7b0b901bfe8bd2b6f4e4ca45a3981ce8993d7d700ad36c49b6d99f
MD5 4d17dc07ca8af251a35da0b680fdd3a9
BLAKE2b-256 f6848bb9390f8ce80085046bf9cc03c35c636ea10ec057ea3b6fdb4ae8f0e36e

See more details on using hashes here.

File details

Details for the file stwfsapy-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: stwfsapy-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 28.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.6.9

File hashes

Hashes for stwfsapy-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 883a183739a0f357d944a161eb187fede9337233c4417013ebe689201b6c8998
MD5 a27938ded6726f9f8ff4b273c5832d7b
BLAKE2b-256 d00b8b37e5af90349152a0bf7d0afc63a0230e422801a795258b2e46378dfc27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page