dkpro-cassis·PyPI

UIMA CAS processing library in Python

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

jcklie rec

These details have not been verified by PyPI

Project links

Project description

https://github.com/dkpro/dkpro-cassis/actions/workflows/run_tests.yml/badge.svg

https://codecov.io/gh/dkpro/dkpro-cassis/branch/master/graph/badge.svg

https://img.shields.io/badge/code%20style-black-000000.svg

DKPro cassis (pronunciation: [ka.sis]) provides a pure-Python implementation of the Common Analysis System (CAS) as defined by the UIMA framework. The CAS is a data structure representing an object to be enriched with annotations (the co-called Subject of Analysis, short SofA).

This library enables the creation and manipulation of annotated documents (CAS objects) and their associated type systems as well as loading and saving them in the CAS XMI XML representation or the CAS JSON representation in Python programs. This can ease in particular the integration of Python-based Natural Language Processing (e.g. spacy or NLTK) and Machine Learning librarys (e.g. scikit-learn or Keras) in UIMA-based text analysis workflows.

An example of cassis in action is the spacy recommender for INCEpTION, which wraps the spacy NLP library as a web service which can be used in conjunction with the INCEpTION text annotation platform to automatically generate annotation suggestions.

Features

Currently supported features are:

Text SofAs
Deserializing/serializing UIMA CAS from/to XMI
Deserializing/serializing UIMA CAS from/to JSON
Deserializing/serializing type systems from/to XML
Selecting annotations, selecting covered annotations, adding annotations
Type inheritance
Multiple SofA support
Type system can be changed after loading
Primitive and reference features and arrays of primitives and references

Some features are still under development, e.g.

Proper type checking
XML/XMI schema validation

Installation

To install the package with pip, just run

pip install dkpro-cassis

Usage

Example CAS XMI and types system files can be found under tests\test_files.

Reading a CAS file

From XMI: A CAS can be deserialized from the UIMA CAS XMI (XML 1.0) format either by reading from a file or string using load_cas_from_xmi.

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('cas.xmi', 'rb') as f:
   cas = load_cas_from_xmi(f, typesystem=typesystem)

From JSON: The UIMA JSON CAS format is also supported and can be loaded using load_cas_from_json. Most UIMA JSON CAS files come with an embedded typesystem, so it is not necessary to specify one.

from cassis import *

with open('cas.json', 'rb') as f:
   cas = load_cas_from_json(f)

Writing a CAS file

To XMI: A CAS can be serialized to XMI either by writing to a file or be returned as a string using cas.to_xmi().

from cassis import *

# Returned as a string
xmi = cas.to_xmi()

# Written to file
cas.to_xmi("my_cas.xmi")

To JSON: A CAS can also be written to JSON using cas.to_json().

from cassis import *

# Returned as a string
xmi = cas.to_json()

# Written to file
cas.to_json("my_cas.json")

Creating a CAS

A CAS (Common Analysis System) object typically represents a (text) document. When using cassis, you will likely most often reading existing CAS files, modify them and then writing them out again. But you can also create CAS objects from scratch, e.g. if you want to convert some data into a CAS object in order to create a pre-annotated text. If you do not have a pre-defined typesystem to work with, you will have to define one.

typesystem = TypeSystem()

cas = Cas(
    sofa_string = "Joe waited for the train . The train was late .",
    document_language = "en",
    typesystem = typesystem)

print(cas.sofa_string)
print(cas.sofa_mime)
print(cas.document_language)

Adding annotations

Note: type names used below are examples only. The actual CAS files you will be dealing with will use other names! You can get a list of the types using cas.typesystem.get_types().

Given a type system with a type cassis.Token that has an id and pos feature, annotations can be added in the following:

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('cas.xmi', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

Token = typesystem.get_type('cassis.Token')

tokens = [
    Token(begin=0, end=3, id='0', pos='NNP'),
    Token(begin=4, end=10, id='1', pos='VBD'),
    Token(begin=11, end=14, id='2', pos='IN'),
    Token(begin=15, end=18, id='3', pos='DT'),
    Token(begin=19, end=24, id='4', pos='NN'),
    Token(begin=25, end=26, id='5', pos='.'),
]

for token in tokens:
    cas.add(token)

Selecting annotations

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('cas.xmi', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

for sentence in cas.select('cassis.Sentence'):
    for token in cas.select_covered('cassis.Token', sentence):
        print(token.get_covered_text())

        # Annotation values can be accessed as properties
        print('Token: begin={0}, end={1}, id={2}, pos={3}'.format(token.begin, token.end, token.id, token.pos))

Getting and setting (nested) features

If you want to access a variable but only have its name as a string or have nested feature structures, e.g. a feature structure with feature a that has a feature b that has a feature c, some of which can be None, then you can use the following:

fs.get("var_name") # Or
fs["var_name"]

Or in the nested case,

fs.get("a.b.c")
fs["a.b.c"]

If a or b or c are None, then this returns instead of throwing an error.

Another example would be a StringList containing ["Foo", "Bar", "Baz"]:

assert lst.get("head") == "foo"
assert lst.get("tail.head") == "bar"
assert lst.get("tail.tail.head") == "baz"
assert lst.get("tail.tail.tail.head") == None
assert lst.get("tail.tail.tail.tail.head") == None

The same goes for setting:

# Functional
lst.set("head", "new_foo")
lst.set("tail.head", "new_bar")
lst.set("tail.tail.head", "new_baz")

assert lst.get("head") == "new_foo"
assert lst.get("tail.head") == "new_bar"
assert lst.get("tail.tail.head") == "new_baz"

# Bracket access
lst["head"] = "newer_foo"
lst["tail.head"] = "newer_bar"
lst["tail.tail.head"] = "newer_baz"

assert lst["head"] == "newer_foo"
assert lst["tail.head"] == "newer_bar"
assert lst["tail.tail.head"] == "newer_baz"

Creating types and adding features

from cassis import *

typesystem = TypeSystem()

parent_type = typesystem.create_type(name='example.ParentType')
typesystem.create_feature(domainType=parent_type, name='parentFeature', rangeType=TYPE_NAME_STRING)

child_type = typesystem.create_type(name='example.ChildType', supertypeName=parent_type.name)
typesystem.create_feature(domainType=child_type, name='childFeature', rangeType=TYPE_NAME_INTEGER)

annotation = child_type(parentFeature='parent', childFeature='child')

When adding new features, these changes are propagated. For example, adding a feature to a parent type makes it available to a child type. Therefore, the type system does not need to be frozen for consistency. The type system can be changed even after loading, it is not frozen like in UIMAj.

Sofa support

A Sofa represents some form of an unstructured artifact that is processed in a UIMA pipeline. It contains for instance the document text. Currently, new Sofas can be created. This is automatically done when creating a new view. Basic properties of the Sofa can be read and written:

cas = Cas(
    sofa_string = "Joe waited for the train . The train was late .",
    document_language = "en")

print(cas.sofa_string)
print(cas.sofa_mime)
print(cas.document_language)

Array support

Array feature values are not simply Python arrays, but they are wrapped in a feature structure of a UIMA array type such as uima.cas.FSArray.

# Setting up an annotation type with an array feature containing
# references to other annotations
typesystem = TypeSystem()
ArrayHolder = typesystem.create_type(name='example.ArrayHolder')
typesystem.create_feature(domainType=ArrayHolder, name='values', rangeType=TYPE_NAME_FS_ARRAY)

cas = Cas(typesystem=typesystem)

# Populating the document an annotation that contains references to another annotation in its array feature
Annotation = cas.typesystem.get_type(TYPE_NAME_ANNOTATION)
FSArray = cas.typesystem.get_type(TYPE_NAME_FS_ARRAY)
ann = Annotation(begin=0, end=1)
cas.add(ann)
holder = ArrayHolder(values=FSArray(elements=[ann, ann, ann]))
cas.add(holder)

# Reading the elements from the array feature
for e in holder.values.elements:
    print(e)

Managing views

A view into a CAS contains a subset of feature structures and annotations. One view corresponds to exactly one Sofa. It can also be used to query and alter information about the Sofa, e.g. the document text. Annotations added to one view are not visible in another view. A view Views can be created and changed. A view has the same methods and attributes as a Cas .

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)
Token = typesystem.get_type('cassis.Token')

# This creates automatically the view `_InitialView`
cas = Cas()
cas.sofa_string = "I like cheese ."

cas.add_all([
    Token(begin=0, end=1),
    Token(begin=2, end=6),
    Token(begin=7, end=13),
    Token(begin=14, end=15)
])

print([x.get_covered_text() for x in cas.select_all()])

# Create a new view and work on it.
view = cas.create_view('testView')
view.sofa_string = "I like blackcurrant ."

view.add_all([
    Token(begin=0, end=1),
    Token(begin=2, end=6),
    Token(begin=7, end=19),
    Token(begin=20, end=21)
])

print([x.get_covered_text() for x in view.select_all()])

Merging type systems

Sometimes, it is desirable to merge two type systems. With cassis, this can be achieved via the merge_typesystems function. The detailed rules of merging can be found here.

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

ts = merge_typesystems([typesystem, load_dkpro_core_typesystem()])

Type checking

When adding annotations, no type checking is performed for simplicity reasons. In order to check types, call the cas.typecheck() method. Currently, it only checks whether elements in uima.cas.FSArray are adhere to the specified elementType.

DKPro Core Integration

A CAS using the DKPro Core Type System can be created via

from cassis import *

cas = Cas(typesystem=load_dkpro_core_typesystem())

for t in cas.typesystem.get_types():
    print(t)

Miscellaneous

If feature names clash with Python magic variables

If your type system defines a type called self or type, then it will be made available as a member variable self_ or type_ on the respective type:

from cassis import *
from cassis.typesystem import *

typesystem = TypeSystem()

ExampleType = typesystem.create_type(name='example.Type')
typesystem.create_feature(domainType=ExampleType, name='self', rangeType=TYPE_NAME_STRING)
typesystem.create_feature(domainType=ExampleType, name='type', rangeType=TYPE_NAME_STRING)

annotation = ExampleType(self_="Test string1", type_="Test string2")

print(annotation.self_)
print(annotation.type_)

Leniency

If the type for a feature structure is not found in the typesystem, it will raise an exception by default. If you want to ignore these kind of errors, you can pass lenient=True to the Cas constructor or to load_cas_from_xmi.

Large XMI files

If you try to parse large XMI files and get an error message like XMLSyntaxError: internal error: Huge input lookup, then you can disable this security check by passing trusted=True to your calls to load_cas_from_xmi.

Citing & Authors

If you find this repository helpful, feel free to cite

@software{klie2020_cassis,
  author       = {Jan-Christoph Klie and
                  Richard Eckart de Castilho},
  title        = {DKPro Cassis - Reading and Writing UIMA CAS Files in Python},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.3994108},
  url          = {https://github.com/dkpro/dkpro-cassis}
}

Development

The required dependencies are managed by pip. A virtual environment containing all needed packages for development and production can be created and activated by

virtualenv venv --python=python3 --no-site-packages
source venv/bin/activate
pip install -e ".[test, dev, doc]"

The tests can be run in the current environment by invoking

make test

or in a clean environment via

tox

Release

Make sure all issues for the milestone are completed, otherwise move them to the next
Checkout the main branch
Bump the version in pyproject.toml to a stable one, e.g. __version__ = "0.6.0", commit and push, wait until the build completed. An example commit message would be No issue. Release 0.6.0
Create a tag for that version via e.g. git tag v0.6.0 and push the tags via git push --tags. Pushing a tag triggers the release to pypi
Bump the version in pyproject.toml to the next development version, e.g. 0.7.0-dev, commit and push that. An example commit message would be No issue. Bump version after release
Once the build has completed and pypi accepted the new version, go to the Github release and write the changelog based on the issues in the respective milestone
Create a new milestone for the next version

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

jcklie rec

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.10.1

Mar 21, 2025

0.10.0

Mar 18, 2025

0.9.1

Feb 29, 2024

0.9.1.dev0 pre-release

Feb 29, 2024

0.9.0

Feb 4, 2024

0.8.0

Oct 5, 2023

0.7.6

Apr 30, 2023

0.7.5

Feb 9, 2023

0.7.4

Jan 31, 2023

0.7.3

Oct 25, 2022

0.7.2

Jul 8, 2022

0.7.1

Mar 25, 2022

0.7.0

Dec 17, 2021

0.6.1

Sep 29, 2021

0.6.0

Sep 29, 2021

0.5.3

Jun 30, 2021

0.5.2

Apr 12, 2021

0.5.1

Feb 3, 2021

0.5.0

Nov 29, 2020

0.4.0

Oct 11, 2020

0.3.0

Aug 21, 2020

0.2.9

Feb 21, 2020

0.2.8

Feb 18, 2020

0.2.7

Dec 15, 2019

0.2.6

Dec 8, 2019

0.2.5

Nov 27, 2019

0.2.4

Nov 21, 2019

0.2.3

Nov 11, 2019

0.2.3.dev0 pre-release

Nov 11, 2019

0.2.2

Nov 2, 2019

0.2.1

Sep 9, 2019

0.2.0rc2 pre-release

Jul 25, 2019

0.2.0rc1 pre-release

Jul 17, 2019

0.1.1

Dec 6, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dkpro_cassis-0.10.1.tar.gz (124.3 kB view details)

Uploaded Mar 21, 2025 Source

Built Distribution

dkpro_cassis-0.10.1-py3-none-any.whl (62.3 kB view details)

Uploaded Mar 21, 2025 Python 3

File details

Details for the file dkpro_cassis-0.10.1.tar.gz.

File metadata

Download URL: dkpro_cassis-0.10.1.tar.gz
Upload date: Mar 21, 2025
Size: 124.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dkpro_cassis-0.10.1.tar.gz
Algorithm	Hash digest
SHA256	`bf594bc4f65997e41206d71d0706456ca3878f8bf5adba585d883d9172d1d829`
MD5	`8e2c4e8c7008b88b5fd1c0da5b5e21f9`
BLAKE2b-256	`692c2a105544d9418098f7d45b8876935578c5e1a8b3f28e0288d8200ffea3e6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dkpro_cassis-0.10.1.tar.gz:

Publisher: publish_to_pypi.yml on dkpro/dkpro-cassis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dkpro_cassis-0.10.1.tar.gz
- Subject digest: bf594bc4f65997e41206d71d0706456ca3878f8bf5adba585d883d9172d1d829
- Sigstore transparency entry: 185974767
- Sigstore integration time: Mar 21, 2025
Source repository:
- Permalink: dkpro/dkpro-cassis@b8051986e3744a5bb2d067f69d03112382a1b291
- Branch / Tag: refs/tags/v.0.10.1
- Owner: https://github.com/dkpro
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_to_pypi.yml@b8051986e3744a5bb2d067f69d03112382a1b291
- Trigger Event: push

File details

Details for the file dkpro_cassis-0.10.1-py3-none-any.whl.

File metadata

Download URL: dkpro_cassis-0.10.1-py3-none-any.whl
Upload date: Mar 21, 2025
Size: 62.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dkpro_cassis-0.10.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`544acda1f948ceba6f0488371585fb74cf8e5541ae9edec90cf4a904a23eb3e5`
MD5	`bda6aee8ef6cb6b10c46cc46e7d1e0be`
BLAKE2b-256	`11cb669877010a958fad494b48490bc2bdcfd28840fa5db00ef1cd1c1cafc577`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dkpro_cassis-0.10.1-py3-none-any.whl:

Publisher: publish_to_pypi.yml on dkpro/dkpro-cassis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dkpro_cassis-0.10.1-py3-none-any.whl
- Subject digest: 544acda1f948ceba6f0488371585fb74cf8e5541ae9edec90cf4a904a23eb3e5
- Sigstore transparency entry: 185974768
- Sigstore integration time: Mar 21, 2025
Source repository:
- Permalink: dkpro/dkpro-cassis@b8051986e3744a5bb2d067f69d03112382a1b291
- Branch / Tag: refs/tags/v.0.10.1
- Owner: https://github.com/dkpro
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_to_pypi.yml@b8051986e3744a5bb2d067f69d03112382a1b291
- Trigger Event: push

dkpro-cassis 0.10.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Installation

Usage

Reading a CAS file

Writing a CAS file

Creating a CAS

Adding annotations

Selecting annotations

Getting and setting (nested) features

Creating types and adding features

Sofa support

Array support

Managing views

Merging type systems

Type checking

DKPro Core Integration

Miscellaneous

If feature names clash with Python magic variables

Leniency

Large XMI files

Citing & Authors

Development

Release

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance