Package for terminology management with TermBase eXchange (TBX)
Project description
determinator
DISCLAIMER - BETA PHASE
This package is currently in a beta phase.
Package for terminology management with the TermBase eXchange (TBX) format
Free software: MIT license
to determinate [ determ-i-nate ]
v.intr, determinated, determinating
To extract terms from one of more text documents and output results in the TermBase eXchange (TBX) format.
Features
Extract expert terminology from documents in the NLP Annotation Format (NAF)
Generate and read TermBase eXchange (TBX) files according to ISO 30042:2019 (TBX-DNB dialect)
Add references and term notes from other sources (for example European IATE term bases)
Overview of the idea
We generate an empty TBX document with
t = determinator.TbxDocument() t.generate(params = {"sourceDesc": "TBX file, created via dnb/determinator"})
Then we extract terms from the Solvency II Delegated Acts (Dutch version) in NAF:
# create terms dictionary of subset of languages terms = {} for language in ['NL', 'EN', 'DE', 'FR', 'ES', 'ET', 'DA', 'SV']: DOC_FILE = "..\\..\\nafigator-data\\data\\legislation\\Solvency II Delegated Acts - "+language+".naf.xml" doc = nafigator.NafDocument().open(DOC_FILE) determinator.merge_terms_dict(terms, nafigator.extract_terms(doc))
Then we create a termbase
# Create an empty TermBase t = determinator.TbxDocument() t.generate(params = {"sourceDesc": "TBX file, created via dnb/determinator"}) t.create_tbx_from_terms_dict(terms=terms, params = {'concept_id_prefix': 'tbx_'})
Then we add references from the InterActive Terminology for Europe (IATE) dataset:
# read the IATE file IATE_FILE = "..//data//iate//IATE_export.tbx" ref = determinator.TbxDocument().open(IATE_FILE) t.copy_from_tbx(reference=ref)
Then we add termnotes from the Dutch Lassy dataset (the small one) including basic insurance terms:
# read the lassy file LASSY_FILE = "..//data//lassy//lassy_with_insurance.tbx" lassy = determinator.TbxDocument().open(LASSY_FILE) t.add_termnotes_from_tbx(reference=lassy, params={'number_of_word_components': 5})
Then we have a termbase with:
<conceptEntry id="249"> <descrip type="subjectField">insurance</descrip> <xref>IATE_2246604</xref> <ref>https://iate.europa.eu/entry/result/2246604/en</ref> <langSec xml:lang="nl"> <termSec> <term>solvabiliteitskapitaalvereiste</term> <termNote type="partOfSpeech">noun</termNote> <note>source: data/Solvency II Delegated Acts - NL.txt (#hits=331)</note> <termNote type="termType">fullForm</termNote> <descrip type="reliabilityCode">9</descrip> <termNote type="lemma">solvabiliteits_kapitaalvereiste</termNote> <termNote type="grammaticalNumber">singular</termNote> <termNoteGrp> <termNote type="component">solvabiliteits-</termNote> <termNote type="component">kapitaal-</termNote> <termNote type="component">vereiste</termNote> </termNoteGrp> </termSec> </langSec> <langSec xml:lang="en"> <termSec> <term>SCR</term> <termNote type="termType">abbreviation</termNote> <descrip type="reliabilityCode">9</descrip> </termSec> <termSec> <term>solvency capital requirement</term> <termNote type="termType">fullForm</termNote> <descrip type="reliabilityCode">9</descrip> <termNote type="partOfSpeech">noun, noun, noun</termNote> <note>source: data/Solvency II Delegated Acts - EN.txt (#hits=266)</note> </termSec> </langSec> <langSec xml:lang="fr"> <termSec> <term>capital de solvabilité requis</term> <termNote type="termType">fullForm</termNote> <descrip type="reliabilityCode">9</descrip> <termNote type="partOfSpeech">noun, adp, noun, adj</termNote> <note>source: ../nafigator-data/data/legislation/Solvency II Delegated Acts - FR.txt (#hits=198)</note> </termSec> <termSec> <term>CSR</term> <termNote type="termType">abbreviation</termNote> <descrip type="reliabilityCode">9</descrip> </termSec> </langSec> </conceptEntry>
a reference is included to concept ‘2246604’ from the IATE dataset. From that reference, we can for example derive that the official European term for this concept in English is ‘solvency capital requirement’ and in German ‘Solvenzkapitalanforderung’ and that the term is defined in Directive 2009/138/EC (Solvency II).
termNotes include the partOfSpeech, lemma and morpohoFeats derived from the Lassy dataset (in Dutch). This dataset was extended with insurance related word components and terms that were not included in the Lassy dataset.
also included are the word components of a term. The Dutch language, like the German language, often glues components together to construct new words instead of using separate words like the English language.
Datasets
The TermBase eXchange format
History
0.1.0 (2022-01-02)
First release on PyPI.
0.1.1 (2022-05-22)
SKOS added.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for determinator-0.1.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e4e2b87cfeec81b3bc8ec7f26c193efd4acdaa711d8ba3632b4e9a8a2ccd098 |
|
MD5 | db3f13a59b4a3540f7e0595dc7444b80 |
|
BLAKE2b-256 | 3137bbf9aeb6cca7c9c8af98190d7aa283ccf803c9a9f15379347cb5dc9400ff |