Just Automatic Term Extraction — the definitive Python library for automatic term extraction
Project description
JATE — Just Automatic Term Extraction
A Python library for automatic term extraction (ATE) from text corpora. JATE provides 14 ATE algorithms (13 classical + ensemble voting), corpus-level statistics, built-in evaluation, and a CLI — all pip-installable with no external services required.
Previously known as "Java Automatic Term Extraction" (84+ stars). The original Java/Solr library is preserved on the
legacy/javabranch.
Installation
pip install jate
Or from source:
git clone https://github.com/ziqizhang/jate.git
cd jate
pip install .
Requires Python 3.11+ and a spaCy model:
python -m spacy download en_core_web_sm
Quick start
Single document
import jate
# Extract terms from text (default: C-Value + POS pattern extraction)
result = jate.extract("Your document text here...")
for term in result:
print(f"{term.string:30s} score={term.score:.4f} surfaces={term.surface_forms}")
Corpus-level extraction
import jate
# From a list of texts
result = jate.extract_corpus(
["First document...", "Second document..."],
algorithm="tfidf",
)
# From a directory of text files
result = jate.extract_corpus("path/to/corpus/", algorithm="cvalue")
# Export results
df = result.to_dataframe()
print(result.to_csv())
Compare algorithms
import jate
results = jate.compare(
["Doc one...", "Doc two..."],
algorithms=["cvalue", "tfidf", "rake", "weirdness"],
)
for algo_name, result in results.items():
print(f"\n{algo_name}: {len(result)} terms")
for term in list(result)[:5]:
print(f" {term.string:30s} {term.score:.4f}")
For large corpora, speed up with parallel processing:
config = jate.JATEConfig(max_workers=4)
results = jate.compare(docs, algorithms=["cvalue", "tfidf", "rake"], config=config)
Evaluation against a gold standard
import jate
result = jate.extract_corpus(docs, algorithm="cvalue")
evaluator = jate.Evaluator(gold_terms={"machine learning", "neural network", ...})
eval_result = evaluator.evaluate(result)
print(eval_result.summary())
# P=0.2800 R=0.0644 F1=0.1047 TP=28 FP=72 FN=407 predicted=100 gold=435
# Evaluate top-k
eval_at_50 = evaluator.evaluate_at_k(result, k=50)
CLI
# Extract terms from text
jate extract "Your text here" --algorithm cvalue --top 20
# Extract from a corpus directory
jate corpus path/to/docs/ --algorithm tfidf --output csv
# Compare algorithms on a corpus
jate compare path/to/docs/ --algorithms cvalue tfidf rake
# Run benchmark on built-in dataset
jate benchmark --top 100
Algorithms
| Algorithm | Description | Reference |
|---|---|---|
tfidf |
TF-IDF at corpus level | — |
cvalue |
Multi-word term extraction via nested term frequency | Frantzi et al. 2000 |
ncvalue |
C-Value extended with context word information | Frantzi et al. 2000 |
basic |
Frequency + containment scoring | Bordea et al. 2013 |
combobasic |
Basic with parent and child containment | Bordea et al. 2013 |
attf |
Average total term frequency (TTF / DF) | — |
ttf |
Raw total term frequency | — |
ridf |
Residual IDF (deviation from Poisson) | Church & Gale 1995 |
rake |
Rapid Automatic Keyword Extraction | Rose et al. 2010 |
chi_square |
Chi-square test for term independence | Matsuo & Ishizuka 2003 |
weirdness |
Target vs reference corpus frequency ratio | Ahmad et al. 1999 |
termex |
Domain pertinence + context + lexical cohesion | Sclano et al. 2007 |
glossex |
Domain specificity via glossary comparison | Park et al. 2002 |
voting |
Ensemble via reciprocal rank fusion | — |
Candidate extractors
| Extractor | Description |
|---|---|
pos_pattern (default) |
Regex over Universal POS tags (e.g. (ADJ )*(NOUN )+) |
ngram |
Contiguous token n-grams (configurable min/max n) |
noun_phrase |
spaCy noun chunk detection |
How it works
- Candidate extraction — identifies potential terms using POS patterns, n-grams, or noun phrases
- Lemmatisation — normalises candidates to their lemmatised form (e.g. "neural networks" and "neural network" become one entry)
- Sentence context (automatic) — builds sentence co-occurrence and adjacency features for algorithms that use them (Chi-Square, NC-Value)
- Corpus statistics — builds frequency and co-occurrence counts (in-memory or SQLite-backed)
- Scoring — applies the chosen algorithm to rank candidates
- Output — returns
TermExtractionResultwith the normalised term, score, and all observed surface forms
Each Term in the result contains:
string— the canonical (lemmatised) form, used for scoring and evaluationscore— algorithm-assigned scorefrequency— total corpus frequencysurface_forms— all surface variants observed (e.g.{"neural network", "neural networks", "Neural Networks"})
Roadmap
- spaCy pipeline integration —
nlp.add_pipe("jate") - Interactive web demo — Streamlit UI with HuggingFace Spaces deployment
- More benchmarks — ACTER, GENIA, CoastTerm, TermEval datasets
- Neural methods — BERT-based sequence labeling, embedding-based scoring
- LLM-augmented extraction — optional LLM re-ranking and validation
- Agentic pipeline — LangGraph-powered orchestration for automatic algorithm selection
- Multilingual support — works with any spaCy language model
- Production-ready — strict typing, >90% test coverage, Docker, PyPI publishing
Get involved
JATE is in active development. We'd love your input:
- Feature requests: Open an issue
- Bug reports: Report here
- Star the repo to follow progress
Background
JATE was originally developed as part of research at the University of Sheffield, with publications in venues including ACM TKDD and the Semantic Web Journal. The library has been used in academic and industry settings for terminology extraction, ontology learning, and knowledge graph construction.
Key publications:
- Zhang, Z., Gao, J., Ciravegna, F. (2018). SemRe-Rank: Improving Automatic Term Extraction By Incorporating Semantic Relatedness With Personalised PageRank. ACM TKDD.
- Zhang, Z., Iria, J., Brewster, C., Ciravegna, F. (2008). A Comparative Evaluation of Term Recognition Algorithms. LREC.
License
Apache 2.0 — see LICENSE for details.
Built with research expertise and agentic coding tools.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jate-3.0.0.tar.gz.
File metadata
- Download URL: jate-3.0.0.tar.gz
- Upload date:
- Size: 45.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8df0c245205a06ae4ee306d3ea1e4144a3d2003906bb9f36f964ad039cb6012f
|
|
| MD5 |
f949ea05a6cf0610167398ed27b6f984
|
|
| BLAKE2b-256 |
638394e3dbda626cece9148ce79c4c005a6fd5252ca0394107037c504ca67f3b
|
Provenance
The following attestation bundles were made for jate-3.0.0.tar.gz:
Publisher:
publish.yml on ziqizhang/jate
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
jate-3.0.0.tar.gz -
Subject digest:
8df0c245205a06ae4ee306d3ea1e4144a3d2003906bb9f36f964ad039cb6012f - Sigstore transparency entry: 1075447359
- Sigstore integration time:
-
Permalink:
ziqizhang/jate@75fd227dd5fc698ce5713478a392c301252e1950 -
Branch / Tag:
refs/tags/v3.0.0 - Owner: https://github.com/ziqizhang
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@75fd227dd5fc698ce5713478a392c301252e1950 -
Trigger Event:
release
-
Statement type:
File details
Details for the file jate-3.0.0-py3-none-any.whl.
File metadata
- Download URL: jate-3.0.0-py3-none-any.whl
- Upload date:
- Size: 66.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2039e70a1f6d33d30039f9f42254bda912e8eb4b31ae27fe289c7314173e32a
|
|
| MD5 |
746ffbb4793865607057a6a12a97d2ef
|
|
| BLAKE2b-256 |
d0a2613e4dfb46935fa4c42cdfee52a2e3a8cb1dc8427f08ee26fda0adaf9f38
|
Provenance
The following attestation bundles were made for jate-3.0.0-py3-none-any.whl:
Publisher:
publish.yml on ziqizhang/jate
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
jate-3.0.0-py3-none-any.whl -
Subject digest:
c2039e70a1f6d33d30039f9f42254bda912e8eb4b31ae27fe289c7314173e32a - Sigstore transparency entry: 1075447434
- Sigstore integration time:
-
Permalink:
ziqizhang/jate@75fd227dd5fc698ce5713478a392c301252e1950 -
Branch / Tag:
refs/tags/v3.0.0 - Owner: https://github.com/ziqizhang
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@75fd227dd5fc698ce5713478a392c301252e1950 -
Trigger Event:
release
-
Statement type: