Skip to main content

Invenio module for record classification.

Project description

https://img.shields.io/travis/inveniosoftware-contrib/invenio-classifier.svg https://img.shields.io/coveralls/inveniosoftware-contrib/invenio-classifier.svg https://img.shields.io/github/tag/inveniosoftware-contrib/invenio-classifier.svg https://img.shields.io/pypi/dm/invenio-classifier.svg https://img.shields.io/github/license/inveniosoftware-contrib/invenio-classifier.svg

Invenio module for record classification.

Features

Classifier automatically extracts keywords from fulltext documents. The automatic assignment of keywords to textual documents has clear benefits in the digital library environment as it aids catalogization, classification and retrieval of documents.

Keyword extraction is simple

In order to extract relevant keywords from a document fulltext.pdf based on a controlled vocabulary thesaurus.rdf, you would run Classifier as follows:

${INVENIO_WEB_INSTANCE} classifier extract -k thesaurus.rdf -f fulltext.pdf

Launching ${INVENIO_WEB_INSTANCE} classifier --help shows the options available.

As an example, running classifier on document nucl-th/0204033 using the high-energy physics RDF/SKOS taxonomy (HEP.rdf) would yield the following results (based on the HEP taxonomy from October 10th 2008):

Input file: 0204033.pdf

Author keywords:
Dense matter
Saturation
Unstable nuclei

Composite keywords:
10  nucleus: stability [36, 14]
6  saturation: density [25, 31]
6  energy: symmetry [35, 11]
4  nucleon: density [13, 31]
3  energy: Coulomb [35, 3]
2  energy: density [35, 31]
2  nuclear matter: asymmetry [21, 2]
1  n: matter [54, 36]
1  n: density [54, 31]
1  n: mass [54, 16]

Single keywords:
61  K0
23  equation of state
12  slope
4  mass number
4  nuclide
3  nuclear model
3  mass formula
2  charge distribution
2  elastic scattering
2  binding energy

Thesaurus

Classifier performs an extraction of keywords based on the recurrence of specific terms, taken from a controlled vocabulary. A controlled vocabulary is a thesaurus of all the terms that are relevant in a specific context. When a context is defined by a discipline or branch of knowledge then the vocabulary is said to be a subject thesaurus. Various existing subject thesauri can be found here.

A subject thesaurus can be expressed in several different formats. Different institutions/disciplines have developed different ways of representing their vocabulary systems. The taxonomy used by classifier is expressed in RDF/SKOS. It allows not only to list keywords but to specify relations between the keywords and alternative ways to represent the same keyword.

<Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#scalar">
 <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.fieldtheoryscalar"/>
 <prefLabel xml:lang="en">scalar</prefLabel>
 <note xml:lang="en">nostandalone</note>
</Concept>

<Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#fieldtheory">
 <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.fieldtheoryscalar"/>
 <prefLabel xml:lang="en">field theory</prefLabel>
 <altLabel xml:lang="en">QFT</altLabel>
 <hiddenLabel xml:lang="en">/field theor\w*/</hiddenLabel>
 <note xml:lang="en">nostandalone</note>
</Concept>

<Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Composite.fieldtheoryscalar">
 <compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#scalar"/>
 <compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#fieldtheory"/>
 <prefLabel xml:lang="en">field theory: scalar</prefLabel>
 <altLabel xml:lang="en">scalar field</altLabel>
</Concept>

In RDF/SKOS, every keyword is wrapped around a concept which encapsulates the full semantics and hierarchical status of a term - including synonyms, alternative forms, broader concepts, notes and so on - rather than just a plain keyword.

The specification of the SKOS language and various manuals that aid the building of a semantic thesaurus can be found at the SKOS W3C website. Furthermore, Classifier can function on top of an extended version of SKOS, which includes special elements such as key chains, composite keywords and special annotations.

Keyword extraction

Classifier computes the keywords of a fulltext document based on the frequency of thesaurus terms in it. In other words, it calculates how many times a thesaurus keyword (and its alternative and hidden labels, defined in the taxonomy) appears in a text and it ranks the results. Unlike other similar systems, Classifier does not use any machine learning or AI methodologies - a just plain phrase matching using regular expressions: it exploits the conformation and richness of the thesaurus to produce accurate results. It is then clear that Classifier performs best on top of rich, well-structured, subject thesauri expressed in the RDF/SKOS language.

Happy hacking and thanks for flying Invenio-Classifier.

Changes

Version 1.2.0 (release 2017-06-21)

Incompatible changes

  • Do not use keywords as dictionary keys, rather as elements in a list.

Version 1.1.2 (release 2017-05-22)

Bug fixes

  • Supports ‘·’ author separator

  • Support utf8 author-keywords

Version 1.1.1 (release 2017-05-19)

Bug fixes

  • Enforce utf8 also for non PDF files in extractor.

Version 1.1.0 (release 2017-05-17)

Incompatible changes

  • Changes dict export format for author keywords, into an improved and semantic way.

  • Renames keys in dict export to be lower case and separated by _.

Bug fixes

  • Drop trailing dots in author keywords.

Version 1.0.1 (release 2017-01-11)

Incompatible changes

  • Changes module to be compatible with Invenio 3.

Bug fixes

  • Fixes a crash when trying to discover a taxonomy when CLASSIFIER_WORKDIR is set to None.

  • Updates minimum dependencies of Invenio packages to newer versions.

  • Removes a bug in bibclassify_keyword_analyzer.py. If a combination is found via a synonym or regexp it is no longer thrown away just because the components of the combination are not found in the text.

  • Adds missing invenio_base dependency.

Version 0.1.0 (release 2015-08-19)

  • Initial public release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invenio-classifier-1.2.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

invenio_classifier-1.2.0-py2.py3-none-any.whl (66.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file invenio-classifier-1.2.0.tar.gz.

File metadata

File hashes

Hashes for invenio-classifier-1.2.0.tar.gz
Algorithm Hash digest
SHA256 ceb103ab22e66e0d6cc9e6c54f8f6ccc2c92e25018739c0a743677db22ac7f40
MD5 dc0496cd49921f6def587654944e7f04
BLAKE2b-256 539e8f2de3aaff4bd6d13974203bf50f8df13475775429521c5b5c7031b4f38a

See more details on using hashes here.

File details

Details for the file invenio_classifier-1.2.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for invenio_classifier-1.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 11d6913b12a5a48d61b66b0c38402d8076e779cdd08fe801f601596675763823
MD5 1efc5607bf08eac8b4f01371980290f4
BLAKE2b-256 a432112030c4966bcae6e9d4b18f062da02a4cdf2e2fe536ff668e236677dc21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page