Skip to main content

Just Automatic Term Extraction — the definitive Python library for automatic term extraction

Project description

JATE — Just Automatic Term Extraction

A Python library for automatic term extraction (ATE) from text corpora. JATE provides 14 ATE algorithms (13 classical + ensemble voting), corpus-level statistics, built-in evaluation, and a CLI — all pip-installable with no external services required.

Previously known as "Java Automatic Term Extraction" (84+ stars). The original Java/Solr library is preserved on the legacy/java branch.

Installation

pip install jate

Or from source:

git clone https://github.com/ziqizhang/jate.git
cd jate
pip install .

Requires Python 3.11+ and a spaCy model:

python -m spacy download en_core_web_sm

Quick start

Single document

import jate

# Extract terms from text (default: C-Value + POS pattern extraction)
result = jate.extract("Your document text here...")

for term in result:
    print(f"{term.string:30s}  score={term.score:.4f}  surfaces={term.surface_forms}")

Corpus-level extraction

import jate

# From a list of texts
result = jate.extract_corpus(
    ["First document...", "Second document..."],
    algorithm="tfidf",
)

# From a directory of text files
result = jate.extract_corpus("path/to/corpus/", algorithm="cvalue")

# Export results
df = result.to_dataframe()
print(result.to_csv())

Compare algorithms

import jate

results = jate.compare(
    ["Doc one...", "Doc two..."],
    algorithms=["cvalue", "tfidf", "rake", "weirdness"],
)

for algo_name, result in results.items():
    print(f"\n{algo_name}: {len(result)} terms")
    for term in list(result)[:5]:
        print(f"  {term.string:30s}  {term.score:.4f}")

For large corpora, speed up with parallel processing:

config = jate.JATEConfig(max_workers=4)
results = jate.compare(docs, algorithms=["cvalue", "tfidf", "rake"], config=config)

Evaluation against a gold standard

import jate

result = jate.extract_corpus(docs, algorithm="cvalue")

evaluator = jate.Evaluator(gold_terms={"machine learning", "neural network", ...})
eval_result = evaluator.evaluate(result)
print(eval_result.summary())
# P=0.2800  R=0.0644  F1=0.1047  TP=28  FP=72  FN=407  predicted=100  gold=435

# Evaluate top-k
eval_at_50 = evaluator.evaluate_at_k(result, k=50)

CLI

# Extract terms from text
jate extract "Your text here" --algorithm cvalue --top 20

# Extract from a corpus directory
jate corpus path/to/docs/ --algorithm tfidf --output csv

# Compare algorithms on a corpus
jate compare path/to/docs/ --algorithms cvalue tfidf rake

# Run benchmark on built-in dataset
jate benchmark --top 100

Algorithms

Algorithm Description Reference
tfidf TF-IDF at corpus level
cvalue Multi-word term extraction via nested term frequency Frantzi et al. 2000
ncvalue C-Value extended with context word information Frantzi et al. 2000
basic Frequency + containment scoring Bordea et al. 2013
combobasic Basic with parent and child containment Bordea et al. 2013
attf Average total term frequency (TTF / DF)
ttf Raw total term frequency
ridf Residual IDF (deviation from Poisson) Church & Gale 1995
rake Rapid Automatic Keyword Extraction Rose et al. 2010
chi_square Chi-square test for term independence Matsuo & Ishizuka 2003
weirdness Target vs reference corpus frequency ratio Ahmad et al. 1999
termex Domain pertinence + context + lexical cohesion Sclano et al. 2007
glossex Domain specificity via glossary comparison Park et al. 2002
voting Ensemble via reciprocal rank fusion

Candidate extractors

Extractor Description
pos_pattern (default) Regex over Universal POS tags (e.g. (ADJ )*(NOUN )+)
ngram Contiguous token n-grams (configurable min/max n)
noun_phrase spaCy noun chunk detection

How it works

  1. Candidate extraction — identifies potential terms using POS patterns, n-grams, or noun phrases
  2. Lemmatisation — normalises candidates to their lemmatised form (e.g. "neural networks" and "neural network" become one entry)
  3. Sentence context (automatic) — builds sentence co-occurrence and adjacency features for algorithms that use them (Chi-Square, NC-Value)
  4. Corpus statistics — builds frequency and co-occurrence counts (in-memory or SQLite-backed)
  5. Scoring — applies the chosen algorithm to rank candidates
  6. Output — returns TermExtractionResult with the normalised term, score, and all observed surface forms

Each Term in the result contains:

  • string — the canonical (lemmatised) form, used for scoring and evaluation
  • score — algorithm-assigned score
  • frequency — total corpus frequency
  • surface_forms — all surface variants observed (e.g. {"neural network", "neural networks", "Neural Networks"})

Roadmap

  • spaCy pipeline integrationnlp.add_pipe("jate")
  • Interactive web demo — Streamlit UI with HuggingFace Spaces deployment
  • More benchmarks — ACTER, GENIA, CoastTerm, TermEval datasets
  • Neural methods — BERT-based sequence labeling, embedding-based scoring
  • LLM-augmented extraction — optional LLM re-ranking and validation
  • Agentic pipeline — LangGraph-powered orchestration for automatic algorithm selection
  • Multilingual support — works with any spaCy language model
  • Production-ready — strict typing, >90% test coverage, Docker, PyPI publishing

Get involved

JATE is in active development. We'd love your input:

Background

JATE was originally developed as part of research at the University of Sheffield, with publications in venues including ACM TKDD and the Semantic Web Journal. The library has been used in academic and industry settings for terminology extraction, ontology learning, and knowledge graph construction.

Key publications:

  • Zhang, Z., Gao, J., Ciravegna, F. (2018). SemRe-Rank: Improving Automatic Term Extraction By Incorporating Semantic Relatedness With Personalised PageRank. ACM TKDD.
  • Zhang, Z., Iria, J., Brewster, C., Ciravegna, F. (2008). A Comparative Evaluation of Term Recognition Algorithms. LREC.

License

Apache 2.0 — see LICENSE for details.


Built with research expertise and agentic coding tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jate-3.0.0.tar.gz (45.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jate-3.0.0-py3-none-any.whl (66.2 kB view details)

Uploaded Python 3

File details

Details for the file jate-3.0.0.tar.gz.

File metadata

  • Download URL: jate-3.0.0.tar.gz
  • Upload date:
  • Size: 45.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for jate-3.0.0.tar.gz
Algorithm Hash digest
SHA256 8df0c245205a06ae4ee306d3ea1e4144a3d2003906bb9f36f964ad039cb6012f
MD5 f949ea05a6cf0610167398ed27b6f984
BLAKE2b-256 638394e3dbda626cece9148ce79c4c005a6fd5252ca0394107037c504ca67f3b

See more details on using hashes here.

Provenance

The following attestation bundles were made for jate-3.0.0.tar.gz:

Publisher: publish.yml on ziqizhang/jate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jate-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: jate-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 66.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for jate-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2039e70a1f6d33d30039f9f42254bda912e8eb4b31ae27fe289c7314173e32a
MD5 746ffbb4793865607057a6a12a97d2ef
BLAKE2b-256 d0a2613e4dfb46935fa4c42cdfee52a2e3a8cb1dc8427f08ee26fda0adaf9f38

See more details on using hashes here.

Provenance

The following attestation bundles were made for jate-3.0.0-py3-none-any.whl:

Publisher: publish.yml on ziqizhang/jate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page