Skip to main content

Just Automatic Term Extraction — the definitive Python library for automatic term extraction

Project description

JATE — Term Extraction

JATE — Just Automatic Term Extraction

A Python library for automatic term extraction (ATE) from text corpora. JATE provides 14 classical ATE algorithms, corpus-level statistics, built-in evaluation, and a CLI — all pip-installable with no external services required.

JATE v3 is a complete rewrite of the original Java JATE library (84+ GitHub stars), which was built on Apache Solr and used in academic and industry settings for over a decade. The Python version preserves all 13 original classical algorithms from the Java codebase — with every formula verified line-by-line against the original source — while removing the Solr dependency in favour of a self-contained, pip-installable package. It also adds ensemble voting via reciprocal rank fusion when comparing multiple algorithms. The original Java library is preserved on the legacy/java branch.

Sneak Peek

Try it now — no installation needed

Launch the live demo on Hugging Face Spaces — paste any text, pick from 14 algorithms, and see extracted terms instantly in your browser.

JATE online demo on Hugging Face Spaces

Clone the repo for full features locally

The local UI gives you everything the online demo has and more — corpus-level extraction across entire directories, multi-algorithm comparison with a shared NLP pipeline, real-time progress streaming, and full CSV/JSON export. All processing happens on your machine, so there are no size limits and your data stays private.

pip install "jate[server]"
jate ui

JATE local UI — side-by-side multi-algorithm corpus comparison

Installation

pip install jate

Or from source:

git clone https://github.com/ziqizhang/jate.git
cd jate
pip install .

Requires Python 3.11+ and a spaCy model:

python -m spacy download en_core_web_sm

Quick start

Single document

import jate

# Extract terms from text (default: C-Value + POS pattern extraction)
result = jate.extract("Your document text here...")

for term in result:
    print(f"{term.string:30s}  score={term.score:.4f}  surfaces={term.surface_forms}")

Corpus-level extraction

import jate

# From a list of texts
result = jate.extract_corpus(
    ["First document...", "Second document..."],
    algorithm="tfidf",
)

# From a directory of text files
result = jate.extract_corpus("path/to/corpus/", algorithm="cvalue")

# Export results
df = result.to_dataframe()
print(result.to_csv())

Compare algorithms

import jate

results = jate.compare(
    ["Doc one...", "Doc two..."],
    algorithms=["cvalue", "tfidf", "rake", "weirdness"],
)

for algo_name, result in results.items():
    print(f"\n{algo_name}: {len(result)} terms")
    for term in list(result)[:5]:
        print(f"  {term.string:30s}  {term.score:.4f}")

For large corpora, NLP processing (spaCy) uses multi-threaded C-level batching, and feature building (adjacent word computation) uses multi-process parallelism automatically.

Evaluation against a gold standard

import jate

result = jate.extract_corpus(docs, algorithm="cvalue")

evaluator = jate.Evaluator({"machine learning", "neural network", ...})
eval_result = evaluator.evaluate(result)
print(eval_result.summary())
# P=0.2800  R=0.0644  F1=0.1047  TP=28  FP=72  FN=407  predicted=100  gold=435

# Evaluate top-k
eval_at_50 = evaluator.evaluate_at_k(result, k=50)

CLI

# Extract terms from text
jate extract "Your text here" --algorithm cvalue --top 20

# Extract from a corpus directory
jate corpus path/to/docs/ --algorithm tfidf --output csv

# Compare algorithms on a corpus
jate compare path/to/docs/ --algorithms cvalue tfidf rake

# Run benchmark on built-in dataset (use --list-datasets to see all options)
jate benchmark --dataset acl_rdtec_mini --top 100

REST API (thin server)

JATE now ships a thin JSON API server on top of the core extraction API.

Install server dependencies:

pip install "jate[server]"

Start the server:

jate-api

Or with Python module execution:

python -m uvicorn jate.server:app --host 0.0.0.0 --port 8000

Extract terms over HTTP:

curl --header "Content-Type: application/json" \
    --request POST \
    --data '{"text":"text to process","algorithm":"cvalue"}' \
    http://localhost:8000/jate/api/v1/extract

Health checks:

curl http://localhost:8000/health/live
curl http://localhost:8000/health/ready

Docker / Containerization

Build the image from repo root:

docker build -t jate:latest .

Run modes:

# 1) CLI mode (default)
docker run --rm jate:latest jate extract "local post office" --algorithm cvalue --top 20

# Corpus mode with local volume mount (recommended for local files)
docker run --rm -v "/path/to/local/folder:/data" jate:latest \
    jate corpus /data --algorithm cvalue --top 20

# 2) API mode (explicit)
docker run --rm -d -p 8000:8000 --name jate-api-test jate:latest jate-api

# 3) Interactive mode with local corpus volume
docker run -it --rm -v "$(pwd)/path/to/docs:/data" jate:latest sh
# inside container:
# jate corpus /data --algorithm tfidf --output csv

Test API endpoints (when running API mode):

# Liveness
curl -s http://localhost:8000/health/live

# Readiness (validates spaCy model availability)
curl -s http://localhost:8000/health/ready

# Capabilities
curl -s http://localhost:8000/jate/api/v1/capabilities

# Extract terms
curl -s -X POST http://localhost:8000/jate/api/v1/extract \
    -H "Content-Type: application/json" \
    -d '{"text":"Russia says its consulate in Isfahan, Iran was damaged over the weekend as a result of strikes on the local governor'\''s office.","algorithm":"cvalue","top":6}'

Stop API mode container:

docker stop jate-api-test

Run dual-mode Docker smoke checks (CLI + API) with one build:

bash scripts/docker_smoke_test.sh

Expected extract response shape:

{
    "algorithm": "cvalue",
    "extractor": "pos_pattern",
    "model": "en_core_web_sm",
    "top": 6,
    "terms": [
        {
            "rank": 1,
            "term": "local governors office",
            "score": 1.6323,
            "frequency": 1,
            "surface_forms": ["local governors office"],
            "metadata": {}
        }
    ]
}

Algorithms

Algorithm Description Reference
tfidf TF-IDF at corpus level
cvalue Multi-word term extraction via nested term frequency Frantzi et al. 2000
ncvalue C-Value extended with context word information Frantzi et al. 2000
basic Frequency + containment scoring Bordea et al. 2013
combobasic Basic with parent and child containment Bordea et al. 2013
attf Average total term frequency (TTF / DF)
ttf Raw total term frequency
ridf Residual IDF (deviation from Poisson) Church & Gale 1995
rake Rapid Automatic Keyword Extraction Rose et al. 2010
chi_square Chi-square test for term independence Matsuo & Ishizuka 2003
weirdness Target vs reference corpus frequency ratio Ahmad et al. 1999
termex Domain pertinence + context + lexical cohesion Sclano et al. 2007
glossex Domain specificity via glossary comparison Park et al. 2002
nmf Topic modelling via Non-negative Matrix Factorisation

Multi-algorithm comparison is available via jate.compare(), which also supports ensemble voting via reciprocal rank fusion (voting=True).

Neural taggers (optional)

JATE also supports transformer-based term taggers that extract terms per-document using BIO sequence labelling. Install with pip install "jate[neural]".

Tagger Description Reference
xlmr-tagger XLM-RoBERTa token classifier, multilingual (100 languages) Lang et al. 2021
roberta-tagger RoBERTa token classifier, English only, faster
from jate.algorithms.bert_tagger import XLMRTagger

tagger = XLMRTagger()  # auto-downloads from HuggingFace on first use
result = tagger.tag("Corruption in public procurement is a major challenge.")

for term in result:
    print(f"{term.string:30s}  confidence={term.score:.4f}")

Pre-trained model: ziqizhang2026/jate-ate-xlmr (trained on ACTER). Train your own: python examples/train_bert_tagger.ipynb on Google Colab.

Try the demo: python examples/tagger_demo.py

Candidate extractors

Extractor Description
pos_pattern (default) Regex over Universal POS tags (default: (ADJ|NOUN|PROPN)*(NOUN|PROPN), configurable via pattern presets)
ngram Contiguous token n-grams (configurable min/max n)
noun_phrase spaCy noun chunk detection

How it works

  1. Candidate extraction — identifies potential terms using POS patterns, n-grams, or noun phrases
  2. Lemmatisation — normalises candidates to their lemmatised form (e.g. "neural networks" and "neural network" become one entry)
  3. Sentence context (automatic) — builds sentence co-occurrence and adjacency features for algorithms that use them (Chi-Square, NC-Value)
  4. Corpus statistics — builds frequency and co-occurrence counts (in-memory or SQLite-backed)
  5. Scoring — applies the chosen algorithm to rank candidates
  6. Output — returns TermExtractionResult with the normalised term, score, and all observed surface forms

Each Term in the result contains:

  • string — the canonical (lemmatised) form, used for scoring and evaluation
  • score — algorithm-assigned score
  • frequency — total corpus frequency
  • surface_forms — all surface variants observed (e.g. {"neural network", "neural networks", "Neural Networks"})

spaCy Integration

JATE can be used as a native spaCy pipeline component, reusing the NLP processing already done by spaCy (no double computation):

import spacy
import jate

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("jate", config={"algorithm": "cvalue"})

doc = nlp("Machine learning and neural networks improve deep learning models.")

for term in doc._.terms:
    surface = doc.text[term.spans[0].start:term.spans[0].end] if term.spans else ""
    print(f"{term.string:30s}  score={term.score:.4f}  at {surface!r}")

Configuration options:

Option Default Description
algorithm "cvalue" Any of the 13 algorithms
pattern "default" POS pattern preset (default, genia, acl_rdtec)
min_frequency 1 Minimum term frequency
min_words 1 Minimum words per term
max_words None Maximum words per term
reference_frequency_file None Path to reference corpus (for weirdness, glossex, termex)

Important notes:

  • One algorithm per pipeline (for multi-algorithm comparison, use jate.compare())
  • All algorithms will warn about single-document mode — they are corpus-level methods designed for multi-document extraction. Results on single documents are functional but weaker.
  • TF-IDF will return empty results on single documents (IDF = 0).

Try the demo: python examples/spacy_demo.py

Local UI

The local UI (jate ui) opens at http://localhost:8080 with two modes:

Extract — paste text or upload a file, select from 14 algorithms with per-algorithm tuning parameters, view results as a ranked table or as highlighted text, and export to CSV/JSON.

JATE Extract — results table with 14 algorithms

Corpus — point to a directory of .txt files, select multiple algorithms at once (the shared NLP pipeline makes multi-algorithm runs significantly faster than running them separately), watch real-time progress via Server-Sent Events, and compare results side-by-side.

Or run via Docker:

docker run --rm -p 8080:8080 jate:latest jate ui

Benchmarks

JATE is evaluated on 4 standard ATE datasets using P@K (precision at top-K ranked terms). Best algorithm per dataset at P@100:

Dataset Domain Docs Gold terms Best P@100 Algorithm
GENIA Biomedical 2,000 35,298 0.79 attf
ACL RD-TEC 2.0 Comp. linguistics 1,758 5,031 0.73 ttf
ACTER v1.5 Multi-domain 241 5,329 0.61 basic
CoastTerm Coastal science 2,004 4,316 0.59 combobasic

Full results with P@100 through P@10,000 for all 13 algorithms, methodology notes, and comparison with published baselines: benchmark results.

Contributing

Please read the contributing guide first for development setup, branch workflow, and agentic coding harness details. JATE is in active development and we welcome contributions. Here's how you can get involved:

  • Browse open issues — check the feature roadmap for planned enhancements
  • Good first issues — look for issues labelled good first issue if you're new to the project
  • Feature requestsopen an issue to suggest new features
  • Bug reportsreport here
  • Star the repo to follow progress

Background

JATE was originally developed as part of research at the University of Sheffield, with publications in venues including ACM TKDD and the Semantic Web Journal. The library has been used in academic and industry settings for terminology extraction, ontology learning, and knowledge graph construction.

Key publications:

  • Zhang, Z., Gao, J., Ciravegna, F. (2018). SemRe-Rank: Improving Automatic Term Extraction By Incorporating Semantic Relatedness With Personalised PageRank. ACM TKDD.
  • Zhang, Z., Iria, J., Brewster, C., Ciravegna, F. (2008). A Comparative Evaluation of Term Recognition Algorithms. LREC.

License

Apache 2.0 — see LICENSE for details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jate-3.3.0.tar.gz (3.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jate-3.3.0-py3-none-any.whl (3.2 MB view details)

Uploaded Python 3

File details

Details for the file jate-3.3.0.tar.gz.

File metadata

  • Download URL: jate-3.3.0.tar.gz
  • Upload date:
  • Size: 3.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for jate-3.3.0.tar.gz
Algorithm Hash digest
SHA256 bbea38945adce4e9594effc71dbc1a1ab80c0443d9eb6620bd8f1294588365bc
MD5 9050a140a294139ba346cb4a3341ac9a
BLAKE2b-256 2955caec2dc984e45bd93d46a574b5069cf4f379ca2d6848eb56581f9a399a7c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jate-3.3.0.tar.gz:

Publisher: publish.yml on ziqizhang/jate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jate-3.3.0-py3-none-any.whl.

File metadata

  • Download URL: jate-3.3.0-py3-none-any.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for jate-3.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a395b0f1a895d3013ab86e74068fb0a5512ce869c933ae82a9b9c4c58687add2
MD5 6943b6ff6b0611629891ccde8b45ee94
BLAKE2b-256 5b16bd37128f1df76ab03cfe58269a91943486074b08adea306156dc03fd6b1f

See more details on using hashes here.

Provenance

The following attestation bundles were made for jate-3.3.0-py3-none-any.whl:

Publisher: publish.yml on ziqizhang/jate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page