Skip to main content

Python library for managing and annotating textual corpus using TextTagLib (TTL) format

Project description

Python library for managing and annotating text corpuses using TextTagLib (TTL) format. Main functions are:

  • Multiple storage formats (text files, JSON files, SQLite databases)
  • TTLIG - A human-friendly intelinear gloss format for linguistic documentation
  • Manipuling transcription files directly in ELAN Annotation Format (eaf)

Installation

texttaglib is availble on PyPI.

pip install texttaglib
# or more explicit
python3 -m pip install texttaglib

Basic usage

>>> from texttaglib import ttl
>>> doc = ttl.Document('mydoc')
>>> sent = doc.new_sent("I am a sentence.")
>>> sent
#1: I am a sentence.
>>> sent.ID
1
>>> sent.text
'I am a sentence.'
>>> sent.import_tokens(["I", "am", "a", "sentence", "."])
>>> >>> sent.tokens
[`I`<0:1>, `am`<2:4>, `a`<5:6>, `sentence`<7:15>, `.`<15:16>]
>>> doc.write_ttl()

The script above will generate this corpus

-rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_concepts.txt
-rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_links.txt
-rw-rw-r--.  1 tuananh tuananh      20  3月 29 13:10 mydoc_sents.txt
-rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_tags.txt
-rw-rw-r--.  1 tuananh tuananh      58  3月 29 13:10 mydoc_tokens.txt

ELAN support

texttaglib library contains a command line tool for converting EAF files into CSV.

python -m texttaglib eaf2csv input_elan_file.eaf -o output_file_name.csv

For more complex analyses, texttaglib Python scripts can be used to extract metadata and annotations from ELAN transcripts, for example:

from texttaglib.elan import parse_eaf_stream

# Test ELAN reader function in texttaglib
with open('./data/test.eaf') as eaf_stream:
    elan = parse_eaf_stream(eaf_stream)

# accessing metadata
print(f"Author: {elan.author} | Date: {elan.date} | Format: {elan.fileformat} | Version: {elan.version}")
print(f"Media file: {elan.media_file}")
print(f"Time units: {elan.time_units}")
print(f"Media URL: {elan.media_url} | MIME type: {elan.mime_type}")
print(f"Media relative URL: {elan.relative_media_url}")

# accessing tiers & annotations
for tier in elan.tiers():
    print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
    for ann in tier.annotations:
        print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.value}")

SQLite support

TTL data can be stored in a SQLite database for better corpus analysis. Sample code will be added soon.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texttaglib-0.1a7.tar.gz (26.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page