Skip to main content

spacy wrapper for Trankit, a Transformer-based multilingual neural dependency parser with tokenization and NER

Project description

spaCy + Trankit

This package wraps the Trankit library, so you can use trankit models in a spaCy pipeline.

CI PyPi GitHub Code style: black

Using this wrapper, you'll be able to use the following annotations, computed by your pretrained trankit pipeline/model:

  • Statistical tokenization (reflected in the Doc and its tokens)
  • Lemmatization (token.lemma and token.lemma_)
  • Part-of-speech tagging (token.tag, token.tag_, token.pos, token.pos_)
  • Morphological analysis (token.morph)
  • Dependency parsing (token.dep, token.dep_, token.head)
  • Named entity recognition (doc.ents, token.ent_type, token.ent_type_, token.ent_iob, token.ent_iob_)
  • Sentence segmentation (doc.sents)
  • Multiword token preservation for languages such as Arabic and Hebrew via token._.trankit_expanded

️️️⌛️ Installation

As of v0.2.1 spacy-trankit is only compatible with spaCy v3.x. On Python 3.12, spacy-trankit applies a runtime compatibility patch for the current trankit dataclass issue in adapter_transformers before creating the pipeline. To install the most recent version:

pip install git+https://github.com/imvladikon/spacy-trankit

or from pypi:

pip install spacy-trankit

📖 Usage & Examples

Load pre-trained trankit model into a spaCy pipeline:

import spacy_trankit

# Initialize the pipeline
nlp = spacy_trankit.load("en")

doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)

By default, mwt_strategy="auto" expands multiword tokens when the expanded tokens can be aligned back to the original text without changing doc.text. Expansions that cannot be represented as substrings of the original text are kept non-destructive. For example, Arabic and Hebrew clitic expansions can differ from the surface token, so the spaCy token keeps the original surface form and stores Trankit's expansion under token._.trankit_expanded.

doc = nlp("ذهبت للبيت اليوم")
for token in doc:
    print(token.text, token._.trankit_expanded)

If you always want surface tokens, pass mwt_strategy="preserve". If you need the previous expanded-token behavior and accept that spaCy may have to replace the original text with space-separated expanded tokens for unalignable cases, pass mwt_strategy="expand":

nlp = spacy_trankit.load("ar", mwt_strategy="preserve")
nlp = spacy_trankit.load("ar", mwt_strategy="expand")

Load it from the path:

import spacy_trankit

# Initialize the pipeline
nlp = spacy_trankit.load_from_path(name="en", path="./cache") 

doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)

📦 Model downloads

The Trankit release on PyPI fetches its pretrained models from nlp.uoregon.edu, which is currently unavailable. spacy-trankit bypasses that broken download path and pulls the same artifacts from Trankit's HuggingFace mirror (https://huggingface.co/uonlp/trankit) into the local cache before instantiating the Trankit pipeline. The behaviour is automatic; no extra setup is needed.

If you mirror the artifacts elsewhere (e.g. for offline / air-gapped use), point spacy-trankit at it via the SPACY_TRANKIT_MODEL_URL environment variable. The template understands {version}, {embedding} and {lang}:

export SPACY_TRANKIT_MODEL_URL="https://my-mirror.example.com/trankit/{version}/{embedding}/{lang}.zip"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_trankit-0.2.3.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_trankit-0.2.3-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file spacy_trankit-0.2.3.tar.gz.

File metadata

  • Download URL: spacy_trankit-0.2.3.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for spacy_trankit-0.2.3.tar.gz
Algorithm Hash digest
SHA256 fd45e0c70a51b1c242671bc431ba5645941043fb3d697e65a7a25ec6bcf24fc4
MD5 fa5189040a8dd25776350022f46b69bc
BLAKE2b-256 5aac84701a3c4a28ced7997829f48dd079f792f2fb14e85c4325f90bd2e4b604

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_trankit-0.2.3.tar.gz:

Publisher: ci.yml on imvladikon/spacy-trankit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spacy_trankit-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: spacy_trankit-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for spacy_trankit-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5f9c375a9d6dd661d38ebc038b99cacafb5665fe36ea5ea7ec9803673a1f33d8
MD5 3ff15b6fbadac8f10bef48eb795b853e
BLAKE2b-256 80450022467dd41b5a12e929618b35a27c5f2ca736eedd72d999851728014640

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_trankit-0.2.3-py3-none-any.whl:

Publisher: ci.yml on imvladikon/spacy-trankit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page