Skip to main content

UIMA CAS processing library in Python

Project description

https://travis-ci.org/dkpro/dkpro-cassis.svg?branch=master Documentation Status https://img.shields.io/badge/code%20style-black-000000.svg https://codecov.io/gh/dkpro/dkpro-cassis/branch/master/graph/badge.svg

DKPro cassis (pronunciation: [ka.sis]) is a UIMA CAS utility library in Python. Currently supported features are:

  • Deserializing/serializing UIMA CAS from/to XMI

  • Deserializing/serializing type systems from/to XML

  • Selecting annotations, selecting covered annotations, adding annotations

  • Type inheritance

  • sofa support

Some features are still under development, e.g.

  • feature encoding as XML elements (right now only XML attributes work)

  • proper type checking

  • XML/XMI schema validation

  • type unmarshalling from string to the actual type specified in the type system

  • reference, array and list features

Installation

To install the package from the master branch using pip, just run

pip install git+https://github.com/dkpro/dkpro-cassis

Usage

Example CAS XMI and types system files can be found under tests\test_files.

Loading a CAS

A CAS can be deserialized from XMI either by reading from a file or string using load_cas_from_xmi.

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('cas.xml', 'rb') as f:
   cas = load_cas_from_xmi(f, typesystem=typesystem)

Adding annotations

Given a type system with a type cassis.Token that has an id and pos feature, annotations can be added in the following:

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('cas.xml', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

Token = typesystem.get_type('cassis.Token')

tokens = [
    Token(begin=0, end=3, id='0', pos='NNP'),
    Token(begin=4, end=10, id='1', pos='VBD'),
    Token(begin=11, end=14, id='2', pos='IN'),
    Token(begin=15, end=18, id='3', pos='DT'),
    Token(begin=19, end=24, id='4', pos='NN'),
    Token(begin=25, end=26, id='5', pos='.'),
]

for token in tokens:
    cas.add_annotation(token)

Selecting annotations

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('cas.xml', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

for sentence in cas.select('cassis.Sentence'):
    for token in cas.select_covered('cassis.Token', sentence):
        print(cas.get_covered_text(token))

        # Annotation values can be accessed as properties
        print('Token: begin={0}, end={1}, id={2}, pos={3}'.format(token.begin, token.end, token.id, token.pos))

Creating types and adding features

from cassis import *

typesystem = TypeSystem()

parent_type = typesystem.create_type(name='example.ParentType')
typesystem.add_feature(type_=parent_type, name='parentFeature', rangeTypeName='String')

child_type = typesystem.create_type(name='example.ChildType', supertypeName=parent_type.name)
typesystem.add_feature(type_=child_type, name='childFeature', rangeTypeName='Integer')

annotation = child_type(parentFeature='parent', childFeature='child')

When adding new features, these changes are propagated. For example, adding a feature to a parent type makes it available to a child type. Therefore, the type system does not need to be frozen for consistency.

Sofa support

A Sofa represents some form of an unstructured artifact that is processed in a UIMA pipeline. It contains for instance the document text. Currently, new Sofas can be created. This is automatically done when creating a new view. Basic properties of the Sofa can be read and written:

cas = Cas()
cas.sofa_string = "Joe waited for the train . The train was late ."
cas.sofa_mime = "text/plain"

print(cas.sofa_string)
print(cas.sofa_mime)

Managing views

A view into a CAS contains a subset of feature structures and annotations. One view corresponds to exactly one Sofa. It can also be used to query and alter information about the Sofa, e.g. the document text. Annotations added to one view are not visible in another view. A view Views can be created and changed. A view has the same methods and attributes as a Cas .

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)
Token = typesystem.get_type('cassis.Token')

# This creates automatically the view `_InitialView`
cas = Cas()
cas.sofa_string = "I like cheese ."

cas.add_annotations([
    Token(begin=0, end=1),
    Token(begin=2, end=6),
    Token(begin=7, end=13),
    Token(begin=14, end=15)
])

print([cas.get_covered_text(x) for x in cas.select_all()])

# Create a new view and work on it.
view = cas.create_view('testView')
view.sofa_string = "I like blackcurrant ."

view.add_annotations([
    Token(begin=0, end=1),
    Token(begin=2, end=6),
    Token(begin=7, end=19),
    Token(begin=20, end=21)
])

print([view.get_covered_text(x) for x in view.select_all()])

Development

The required dependencies are managed by pip. A virtual environment containing all needed packages for development and production can be created and activated by

virtualenv venv --python=python3 --no-site-packages
sourve venv/bin/activate
pip install -e ".[test, dev, doc]"

The tests can be run in the current environment by invoking

make test

or in a clean environment via

tox

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dkpro-cassis-0.1.1.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dkpro_cassis-0.1.1-py2.py3-none-any.whl (19.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file dkpro-cassis-0.1.1.tar.gz.

File metadata

  • Download URL: dkpro-cassis-0.1.1.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for dkpro-cassis-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a2925c5fe96ba9166e6782cdb8226e67e901a6d4a6127c805c6bc6a33e8c5d91
MD5 ba4dbe0f699e17b15282b1c91f29a079
BLAKE2b-256 d2d6aa9913f4933096b6b148efcd18365ce938939d0a636db056117fa80bb673

See more details on using hashes here.

File details

Details for the file dkpro_cassis-0.1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: dkpro_cassis-0.1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 19.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for dkpro_cassis-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ba6ebfb46101fa417ac77701567f9e294ee5b7a24a2b1b1024371364040a1259
MD5 f13f82ad0aca59846ce9cfdefa543b6f
BLAKE2b-256 386a39ae6a669d1dfa54b2017f9d065cca23fdbe653940bcf496f388a47aca22

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page