UIMA CAS processing library in Python
Project description
DKPro cassis (pronunciation: [ka.sis]) provides a pure-Python implementation of the Common Analysis System (CAS) as defined by the UIMA framework. The CAS is a data structure representing an object to be enrichted with annotations (the co-called Subject of Analysis, short SofA).
This library enables the creation and manipulation of CAS objects and their associated type systems as well as loading and saving CAS objects in the CAS XMI XML representation in Python programs. This can ease in particular the integration of Python-based Natural Language Processing (e.g. spacy or NLTK) and Machine Learning librarys (e.g. scikit-learn or Keras) in UIMA-based text analysis workflows.
An example of cassis in action is the spacy recommender for INCEpTION, which wraps the spacy NLP library as a web service which can be used in conjunction with the INCEpTION text annotation platform to automatically generate annotation suggestions.
Features
Currently supported features are:
Text SofAs
Deserializing/serializing UIMA CAS from/to XMI
Deserializing/serializing type systems from/to XML
Selecting annotations, selecting covered annotations, adding annotations
Type inheritance
Multiple SofA support
Type system can be changed after loading
Reference, array and list features
Some features are still under development, e.g.
Proper type checking
XML/XMI schema validation
Type unmarshalling from string to the actual type specified in the type system
Installation
To install the package with pip
, just run
pip install dkpro-cassis
Usage
Example CAS XMI and types system files can be found under tests\test_files
.
Loading a CAS
A CAS can be deserialized from XMI either by reading from a file or
string using load_cas_from_xmi
.
from cassis import *
with open('typesystem.xml', 'rb') as f:
typesystem = load_typesystem(f)
with open('cas.xml', 'rb') as f:
cas = load_cas_from_xmi(f, typesystem=typesystem)
Saving a CAS as XMI
A CAS can be serialized to XMI either by writing to a file or be
returned as a string using cas.to_xmi()
.
from cassis import *
with open('cas.xml', 'rb') as f:
cas = load_cas_from_xmi(f)
# Returned as a string
xmi = cas.to_xmi()
# Written to file
cas.to_xmi("my_cas.xmi")
Adding annotations
Given a type system with a type cassis.Token
that has an id
and
pos
feature, annotations can be added in the following:
from cassis import *
with open('typesystem.xml', 'rb') as f:
typesystem = load_typesystem(f)
with open('cas.xml', 'rb') as f:
cas = load_cas_from_xmi(f, typesystem=typesystem)
Token = typesystem.get_type('cassis.Token')
tokens = [
Token(begin=0, end=3, id='0', pos='NNP'),
Token(begin=4, end=10, id='1', pos='VBD'),
Token(begin=11, end=14, id='2', pos='IN'),
Token(begin=15, end=18, id='3', pos='DT'),
Token(begin=19, end=24, id='4', pos='NN'),
Token(begin=25, end=26, id='5', pos='.'),
]
for token in tokens:
cas.add_annotation(token)
Selecting annotations
from cassis import *
with open('typesystem.xml', 'rb') as f:
typesystem = load_typesystem(f)
with open('cas.xml', 'rb') as f:
cas = load_cas_from_xmi(f, typesystem=typesystem)
for sentence in cas.select('cassis.Sentence'):
for token in cas.select_covered('cassis.Token', sentence):
print(token.get_covered_text())
# Annotation values can be accessed as properties
print('Token: begin={0}, end={1}, id={2}, pos={3}'.format(token.begin, token.end, token.id, token.pos))
Selecting nested features
If you have nested feature structures, e.g. a feature structure with feature a
that has a
feature b
that has a feature c
, some of which can be None
, then you can use the
following:
fs.get("a.b.c")
If a
or b
or c
are None
, then this returns instead of
throwing an error.
Another example would be a StringList containing ["Foo", "Bar", "Baz"]
:
assert lst.get("head") == "foo"
assert lst.get("tail.head") == "bar"
assert lst.get("tail.tail.head") == "baz"
assert lst.get("tail.tail.tail.head") == None
assert lst.get("tail.tail.tail.tail.head") == None
Creating types and adding features
from cassis import *
typesystem = TypeSystem()
parent_type = typesystem.create_type(name='example.ParentType')
typesystem.add_feature(type_=parent_type, name='parentFeature', rangeTypeName='String')
child_type = typesystem.create_type(name='example.ChildType', supertypeName=parent_type.name)
typesystem.add_feature(type_=child_type, name='childFeature', rangeTypeName='Integer')
annotation = child_type(parentFeature='parent', childFeature='child')
When adding new features, these changes are propagated. For example, adding a feature to a parent type makes it available to a child type. Therefore, the type system does not need to be frozen for consistency. The type system can be changed even after loading, it is not frozen like in UIMAj.
Sofa support
A Sofa represents some form of an unstructured artifact that is processed in a UIMA pipeline. It contains for instance the document text. Currently, new Sofas can be created. This is automatically done when creating a new view. Basic properties of the Sofa can be read and written:
cas = Cas()
cas.sofa_string = "Joe waited for the train . The train was late ."
cas.sofa_mime = "text/plain"
print(cas.sofa_string)
print(cas.sofa_mime)
Managing views
A view into a CAS contains a subset of feature structures and annotations. One view corresponds to exactly one Sofa. It
can also be used to query and alter information about the Sofa, e.g. the document text. Annotations added to one view
are not visible in another view. A view Views can be created and changed. A view has the same methods and attributes
as a Cas
.
from cassis import *
with open('typesystem.xml', 'rb') as f:
typesystem = load_typesystem(f)
Token = typesystem.get_type('cassis.Token')
# This creates automatically the view `_InitialView`
cas = Cas()
cas.sofa_string = "I like cheese ."
cas.add_annotations([
Token(begin=0, end=1),
Token(begin=2, end=6),
Token(begin=7, end=13),
Token(begin=14, end=15)
])
print([x.get_covered_text() for x in cas.select_all()])
# Create a new view and work on it.
view = cas.create_view('testView')
view.sofa_string = "I like blackcurrant ."
view.add_annotations([
Token(begin=0, end=1),
Token(begin=2, end=6),
Token(begin=7, end=19),
Token(begin=20, end=21)
])
print([x.get_covered_text() for x in view.select_all()])
Merging type systems
Sometimes, it is desirable to merge two type systems. With cassis, this can be
achieved via the merge_typesystems
function. The detailed rules of merging can be found
here.
from cassis import *
with open('typesystem.xml', 'rb') as f:
typesystem = load_typesystem(f)
ts = merge_typesystems([typesystem, load_dkpro_core_typesystem()])
Type checking
When adding annotations, no type checking is performed for simplicity reasons.
In order to check types, call the cas.typecheck()
method. Currently, it only
checks whether elements in uima.cas.FSArray` or uima.cas.FSList
are
adhere to the specified elementType
.
DKPro Core Integration
A CAS using the DKPro Core Type System can be created via
from cassis import *
cas = Cas(typesystem=load_dkpro_core_typesystem())
for t in cas.typesystem.get_types():
print(t)
Miscellaneous
If feature names clash with Python magic variables
If your type system defines a type called self
or type
, then it will be made
available as a member variable self_
or type_
on the respective type:
from cassis import *
typesystem = TypeSystem()
ExampleType = typesystem.create_type(name='example.Type')
typesystem.add_feature(type_=ExampleType, name='self', rangeTypeName='String')
typesystem.add_feature(type_=ExampleType, name='type', rangeTypeName='String')
annotation = ExampleType(self_="Test string1", type_="Test string2")
print(annotation.self_)
print(annotation.type_)
Leniency
If the type for a feature structure is not found in the typesystem, it will raise an exception by default.
If you want to ignore these kind of errors, you can pass lenient=True
to the Cas
constructor or
to load_cas_from_xmi
.
Large XMI files
If you try to parse large XMI files and get an error message like XMLSyntaxError: internal error: Huge input lookup
,
then you can disable this security check by passing trusted=True
to your calls to load_cas_from_xmi
.
Development
The required dependencies are managed by pip. A virtual environment containing all needed packages for development and production can be created and activated by
virtualenv venv --python=python3 --no-site-packages source venv/bin/activate pip install -e ".[test, dev, doc]"
The tests can be run in the current environment by invoking
make test
or in a clean environment via
tox
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.