spacy wrapper for Trankit, a Transformer-based multilingual neural dependency parser with tokenization and NER
Project description
spaCy + Trankit
This package wraps the Trankit library, so you can use trankit models in a spaCy pipeline.
Using this wrapper, you'll be able to use the following annotations, computed by
your pretrained trankit pipeline/model:
- Statistical tokenization (reflected in the
Docand its tokens) - Lemmatization (
token.lemmaandtoken.lemma_) - Part-of-speech tagging (
token.tag,token.tag_,token.pos,token.pos_) - Morphological analysis (
token.morph) - Dependency parsing (
token.dep,token.dep_,token.head) - Named entity recognition (
doc.ents,token.ent_type,token.ent_type_,token.ent_iob,token.ent_iob_) - Sentence segmentation (
doc.sents) - Multiword token preservation for languages such as Arabic and Hebrew via
token._.trankit_expanded
️️️⌛️ Installation
As of v0.2.1 spacy-trankit is only compatible with spaCy v3.x.
On Python 3.12, spacy-trankit applies a runtime compatibility patch for the
current trankit dataclass issue in adapter_transformers before creating the
pipeline. To install the most recent version:
pip install git+https://github.com/imvladikon/spacy-trankit
or from pypi:
pip install spacy-trankit
📖 Usage & Examples
Load pre-trained trankit model into a spaCy pipeline:
import spacy_trankit
# Initialize the pipeline
nlp = spacy_trankit.load("en")
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)
By default, mwt_strategy="auto" expands multiword tokens when the expanded
tokens can be aligned back to the original text without changing doc.text.
Expansions that cannot be represented as substrings of the original text are
kept non-destructive. For example, Arabic and Hebrew clitic expansions can
differ from the surface token, so the spaCy token keeps the original surface
form and stores Trankit's expansion under token._.trankit_expanded.
doc = nlp("ذهبت للبيت اليوم")
for token in doc:
print(token.text, token._.trankit_expanded)
If you always want surface tokens, pass mwt_strategy="preserve". If you need
the previous expanded-token behavior and accept that spaCy may have to replace
the original text with space-separated expanded tokens for unalignable cases,
pass mwt_strategy="expand":
nlp = spacy_trankit.load("ar", mwt_strategy="preserve")
nlp = spacy_trankit.load("ar", mwt_strategy="expand")
Load it from the path:
import spacy_trankit
# Initialize the pipeline
nlp = spacy_trankit.load_from_path(name="en", path="./cache")
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)
📦 Model downloads
The Trankit release on PyPI fetches its pretrained models from
nlp.uoregon.edu, which is currently unavailable. spacy-trankit bypasses
that broken download path and pulls the same artifacts from Trankit's
HuggingFace mirror (https://huggingface.co/uonlp/trankit) into the local
cache before instantiating the Trankit pipeline. The behaviour is automatic;
no extra setup is needed.
If you mirror the artifacts elsewhere (e.g. for offline / air-gapped use),
point spacy-trankit at it via the SPACY_TRANKIT_MODEL_URL environment
variable. The template understands {version}, {embedding} and {lang}:
export SPACY_TRANKIT_MODEL_URL="https://my-mirror.example.com/trankit/{version}/{embedding}/{lang}.zip"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spacy_trankit-0.2.3.tar.gz.
File metadata
- Download URL: spacy_trankit-0.2.3.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd45e0c70a51b1c242671bc431ba5645941043fb3d697e65a7a25ec6bcf24fc4
|
|
| MD5 |
fa5189040a8dd25776350022f46b69bc
|
|
| BLAKE2b-256 |
5aac84701a3c4a28ced7997829f48dd079f792f2fb14e85c4325f90bd2e4b604
|
Provenance
The following attestation bundles were made for spacy_trankit-0.2.3.tar.gz:
Publisher:
ci.yml on imvladikon/spacy-trankit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spacy_trankit-0.2.3.tar.gz -
Subject digest:
fd45e0c70a51b1c242671bc431ba5645941043fb3d697e65a7a25ec6bcf24fc4 - Sigstore transparency entry: 1440039286
- Sigstore integration time:
-
Permalink:
imvladikon/spacy-trankit@8a9e3830c17d0e6bff339b6800d7cd345505b4b3 -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/imvladikon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@8a9e3830c17d0e6bff339b6800d7cd345505b4b3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file spacy_trankit-0.2.3-py3-none-any.whl.
File metadata
- Download URL: spacy_trankit-0.2.3-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f9c375a9d6dd661d38ebc038b99cacafb5665fe36ea5ea7ec9803673a1f33d8
|
|
| MD5 |
3ff15b6fbadac8f10bef48eb795b853e
|
|
| BLAKE2b-256 |
80450022467dd41b5a12e929618b35a27c5f2ca736eedd72d999851728014640
|
Provenance
The following attestation bundles were made for spacy_trankit-0.2.3-py3-none-any.whl:
Publisher:
ci.yml on imvladikon/spacy-trankit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spacy_trankit-0.2.3-py3-none-any.whl -
Subject digest:
5f9c375a9d6dd661d38ebc038b99cacafb5665fe36ea5ea7ec9803673a1f33d8 - Sigstore transparency entry: 1440039289
- Sigstore integration time:
-
Permalink:
imvladikon/spacy-trankit@8a9e3830c17d0e6bff339b6800d7cd345505b4b3 -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/imvladikon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@8a9e3830c17d0e6bff339b6800d7cd345505b4b3 -
Trigger Event:
release
-
Statement type: