Simple Annotation Framework

These details have not been verified by PyPI

Project links

Project description

Simple Annotation Framework (SAF)

The Simple Annotation Framework (SAF) is a lightweight Python library for annotating text data. It provides a simple and flexible way to create, manipulate, and export annotations in various formats.

SAF is built upon a minimalistic data model, accessible through its API. This data model is flexible enough to be used by most types of linguistic annotation, and can store other types of data associated to the language items (e.g., statistics, data sources, schemas, etc.)

Installation

To install SAF, you can use pip:

pip install saf

Usage

Importing Text Data

SAF provides importers for different annotated text data formats, including plain text, ConLL and WebAnno.

Plain text

from saf.importers.plain import PlainTextImporter
from saf.constants import annotation
from nltk.tokenize import sent_tokenize, word_tokenize

plain_doc = """
They buy and sell books.
I have no clue.
"""

# Import document
plain_importer = PlainTextImporter(sent_tokenize, word_tokenize)
doc = plain_importer.import_document(plain_doc)

print(len(doc.sentences))  # Number of sentences in the document.
print([tok.surface for tok in doc.sentences[1].tokens])  # Listing tokens for the second sentence in the document.

ConLL

from saf import Document
from saf.constants import annotation
from saf.importers.conll import CoNLLImporter

conll_doc = """
# sent_id = 1
# text = They buy and sell books.
1   They     they    PRON    PRP    Case=Nom|Number=Plur               2   nsubj   2:nsubj|4:nsubj   _
2   buy      buy     VERB    VBP    Number=Plur|Person=3|Tense=Pres    0   root    0:root            _
3   and      and     CCONJ   CC     _                                  4   cc      4:cc              _
4   sell     sell    VERB    VBP    Number=Plur|Person=3|Tense=Pres    2   conj    0:root|2:conj     _
5   books    book    NOUN    NNS    Number=Plur                        2   obj     2:obj|4:obj       SpaceAfter=No
6   .        .       PUNCT   .      _                                  2   punct   2:punct           _

# sent_id = 2
# text = I have no clue.
1   I       I       PRON    PRP   Case=Nom|Number=Sing|Person=1     2   nsubj   _   _
2   have    have    VERB    VBP   Number=Sing|Person=1|Tense=Pres   0   root    _   _
3   no      no      DET     DT    PronType=Neg                      4   det     _   _
4   clue    clue    NOUN    NN    Number=Sing                       2   obj     _   SpaceAfter=No
5   .       .       PUNCT   .     _                                 2   punct   _   _
"""

conll_importer = CoNLLImporter(field_list=[annotation.LEMMA, annotation.UPOS, annotation.POS])
doc = conll_importer.import_document(conll_doc)

print(len(doc.sentences))  # Number of sentences in the document.
print(doc.sentences[0].surface)  # Surface form of the first sentence in the document.
print([tok.annotations[annotation.UPOS] for tok in doc.sentences[1].tokens]) # All universal POS tags from the second sentence.

Annotating Text Data

The saf_datasets library provides various annotated NLP datasets and facilities for automated annotation of your own data.

Exporting Annotated Text Data

SAF provides formatters for different annotation formats:

ConLL

from saf.importers.plain import PlainTextImporter
from saf.constants import annotation
from nltk.tokenize import sent_tokenize, word_tokenize
from saf.formatters.conll import CoNLLFormatter

plain_doc = """
They buy and sell books.
I have no clue.
"""

# Import document
plain_importer = PlainTextImporter(sent_tokenize, word_tokenize)
doc = plain_importer.import_document(plain_doc)

# Annotate tokens
for sent in doc.sentences:
    for i, token in enumerate(sent.tokens):
        token.annotations[annotation.ID] = str(i)

conll_formatter = CoNLLFormatter(field_list=[annotation.ID])
conll_formatted_doc = conll_formatter.dumps(doc)

print(conll_formatted_doc)

Working with vocabularies

Vocabulary objects can be used to quickly index and manage symbols in documents or sentence collections. They facilitate vectorization for language model training, specially with label supervision.

from saf import Document
from saf.constants import annotation
from saf.importers.conll import CoNLLImporter
from saf import Vocabulary

conll_doc = """
# sent_id = 1
# text = They buy and sell books.
1   They     they    PRON    PRP    Case=Nom|Number=Plur               2   nsubj   2:nsubj|4:nsubj   _
2   buy      buy     VERB    VBP    Number=Plur|Person=3|Tense=Pres    0   root    0:root            _
3   and      and     CCONJ   CC     _                                  4   cc      4:cc              _
4   sell     sell    VERB    VBP    Number=Plur|Person=3|Tense=Pres    2   conj    0:root|2:conj     _
5   books    book    NOUN    NNS    Number=Plur                        2   obj     2:obj|4:obj       SpaceAfter=No
6   .        .       PUNCT   .      _                                  2   punct   2:punct           _

# sent_id = 2
# text = I have no clue.
1   I       I       PRON    PRP   Case=Nom|Number=Sing|Person=1     2   nsubj   _   _
2   have    have    VERB    VBP   Number=Sing|Person=1|Tense=Pres   0   root    _   _
3   no      no      DET     DT    PronType=Neg                      4   det     _   _
4   clue    clue    NOUN    NN    Number=Sing                       2   obj     _   SpaceAfter=No
5   .       .       PUNCT   .     _                                 2   punct   _   _
"""

conll_importer = CoNLLImporter(field_list=[annotation.LEMMA, annotation.UPOS, annotation.POS])
doc = conll_importer.import_document(conll_doc)

token_vocab = Vocabulary(doc.sentences, lowercase=False)
upos_vocab = Vocabulary(doc.sentences, source="UPOS", lowercase=False)

# Converting sentences to indices for both tokens and annotations
print(token_vocab.to_indices(doc.sentences))
# [[2, 5, 3, 9, 4, 0], [1, 7, 8, 6, 0]]

print(upos_vocab.to_indices(doc.sentences))
# [[3, 5, 0, 5, 2, 4], [3, 5, 1, 2, 4]]

# Retrieving tokens and annotations from indices
token_vocab.get_symbol(4)
# books

upos_vocab.get_symbol(2)
# NOUN

License

This project is licensed under the GNU General Public License Version 3 - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.1

May 14, 2025

0.6.0

May 14, 2025

0.5.5

Feb 15, 2025

This version

0.5.2

Feb 3, 2025

0.5.1

May 9, 2024

0.5.0

May 8, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saf_nlp-0.5.2.tar.gz (27.3 kB view details)

Uploaded Feb 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

saf_nlp-0.5.2-py3-none-any.whl (34.3 kB view details)

Uploaded Feb 3, 2025 Python 3

File details

Details for the file saf_nlp-0.5.2.tar.gz.

File metadata

Download URL: saf_nlp-0.5.2.tar.gz
Upload date: Feb 3, 2025
Size: 27.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for saf_nlp-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`fba74be1bcb52ca219f56e102cdabb848c146e069fc137d7375cb8eb77c981b1`
MD5	`62c2795a732d359854f93fcc99e14fc0`
BLAKE2b-256	`8c30227a18d2f2aba994e90cedd6354ce6aefa7c2aa650bcc7b58bff0e1a9793`

See more details on using hashes here.

File details

Details for the file saf_nlp-0.5.2-py3-none-any.whl.

File metadata

Download URL: saf_nlp-0.5.2-py3-none-any.whl
Upload date: Feb 3, 2025
Size: 34.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for saf_nlp-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`40463a69c913a959236a36a78ad7ea0dfecb8741d9e05e910d8474baf34dc766`
MD5	`60a7d90f962d57992604f62cd8e330c3`
BLAKE2b-256	`8bba316443d832ed464823e4004deaea7c2ebb3a71ea30a5b2f12875949f32ef`

See more details on using hashes here.

saf-nlp 0.5.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Simple Annotation Framework (SAF)

Installation

Usage

Importing Text Data

Plain text

ConLL

Annotating Text Data

Exporting Annotated Text Data

ConLL

Working with vocabularies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes