Comprehensive morphological analysis: derivational stemming, inflectional analysis, and cross-lingual etymology
Project description
Crosstem
A comprehensive Python package for morphological analysis combining derivational stemming, inflectional analysis, and cross-lingual etymology.
What's New in 1.0
- Rust-accelerated derivational stemming backend via PyO3
- Automatic fallback to pure-Python derivational logic when Rust extension is unavailable
- Backend parity coverage for
stem,get_derivations, andget_word_family - Updated benchmark harness with active-backend vs Python-fallback comparisons
- Production-stable
1.0.0packaging metadata
Why Crosstem?
Crosstem finds true linguistic roots across part-of-speech boundaries, which is something traditional stemmers and lemmatizers cannot do.
What Makes It Different
# Traditional stemmers (Porter, Lancaster) - Rule-based, prone to errors
Porter: "organization" → "organ" # Overstemming loses meaning
# Lemmatizers (WordNet, spaCy) - Only handle inflections, not derivations
WordNet: "organization" → "organization" # Can't cross POS boundaries
WordNet: "beautiful" → "beautiful" # Stuck at adjective form
# Crosstem - Linguistically accurate, crosses POS boundaries
Crosstem: "organization" → "organize" # Noun → Verb (true root)
Crosstem: "beautiful" → "beauty" # Adjective → Noun (semantic base)
Key Advantages
- Cross-POS derivational stemming: Only library that finds roots across parts of speech
- Linguistic accuracy: Uses MorphyNet morphological data, not brittle rules
- Etymology tracing: 4.2M relationships across 2,265 languages (unique feature)
- Word families: Discover complete derivational networks (e.g., organize → 43 related words)
- Fast hybrid runtime: Rust-accelerated derivational engine with automatic pure-Python fallback
- 15 languages: Multilingual morphology support out of the box
Performance Benchmark vs Porter
We compared Crosstem against the widely-used Porter stemmer on 44 English words with 1,000 iterations each.
Speed Results
Crosstem: ~0.036s (~1,217,000 words/sec)
Porter: ~0.490s (~90,000 words/sec)
⚡ Crosstem is ~13× FASTER than Porter
Why? Crosstem uses O(1) hash lookups in JSON dictionaries, while Porter applies sequential pattern-matching rules.
Note: Results averaged over multiple runs; ±3% variance is normal due to system load.
Accuracy Comparison
| Word | Crosstem | Porter | Winner |
|---|---|---|---|
| organization | organize | organ | ✅ Crosstem (finds true root) |
| organizational | organize | organiz | ✅ Crosstem (multi-hop) |
| beautiful | beauty | beauti | ✅ Crosstem (crosses POS) |
| destruction | destruct | destruct | ⚖️ Tie |
| democracy | democracy | democraci | ✅ Crosstem (avoids error) |
| computerization | compute | computer | ✅ Crosstem (deeper root) |
| happiness | happy | happi | ✅ Crosstem (productivity filter avoids "hap") |
| redness | red | red | ⚖️ Tie |
Key Findings:
- Cross-POS stemming: Crosstem finds roots across parts of speech (
organization→organize, verb), Porter cannot - Overstemming prevention: Porter creates non-words (
beauti,organiz), Crosstem always produces real words - Data quality: Crosstem filters bad roots (
democrat), Porter has no quality control - Multi-hop: Crosstem traverses multiple derivations (
organizational→organization→organize), Porter only strips one suffix
When to Use Each
Choose Crosstem when:
- ✅ Need linguistically accurate roots
- ✅ Working with derivational families (organize/organizer/organization)
- ✅ Building semantic search, clustering, or word embeddings
- ✅ Quality matters more than simplicity
- ✅ Multilingual support needed (15 languages)
Choose Porter when:
- ✅ Legacy system compatibility required
- ✅ Working with noisy/misspelled text (rule-based is robust)
- ✅ Only need basic suffix normalization
- ✅ Want the absolute simplest possible solution
Note: Crosstem is now faster than Porter while being more accurate, making it the better choice for most modern NLP applications.
Features
- Derivational Stemming: Find roots across part-of-speech boundaries (organization → organize)
- Inflectional Analysis: Lemmatization and grammatical forms (running → run)
- Cross-Lingual Etymology: Trace word origins across 2,265 languages
- Word Family Analysis: Discover complete derivational networks
Installation
pip install crosstem
crosstem now supports an accelerated Rust derivational backend (PyO3).
When a prebuilt wheel includes the extension, it is used automatically.
If the Rust extension is unavailable, Crosstem falls back to the pure-Python
derivational implementation.
Build From Source With Rust Backend
pip install maturin
maturin develop --manifest-path rust/Cargo.toml
To force the pure-Python path for comparison/debugging:
from crosstem import DerivationalStemmer
stemmer = DerivationalStemmer("eng", use_rust_backend=False)
Optional: Etymology Data
Etymology features require additional data (~1 GB) that's downloaded separately:
from crosstem import download_etymology
# One-time download (saves to package data directory)
download_etymology()
Or from command line:
python -m crosstem.download
Quick Start
Basic Morphological Analysis
from crosstem import MorphologyAnalyzer
# Works immediately - no etymology needed
analyzer = MorphologyAnalyzer('eng', load_etymology=False)
result = analyzer.analyze('organizations')
print(result['derivational_stem']) # 'organize'
print(result['inflectional_lemma']) # 'organization'
With Etymology Features
from crosstem import MorphologyAnalyzer, download_etymology
# Download etymology data first (one-time)
if not MorphologyAnalyzer.is_etymology_available():
download_etymology()
# Now etymology features work
analyzer = MorphologyAnalyzer('eng', load_etymology=True)
result = analyzer.analyze('portmanteau')
print(result['etymology']) # Shows Middle French origin
Usage
Derivational Stemming
from crosstem import DerivationalStemmer
stemmer = DerivationalStemmer('eng')
stemmer.stem('organization') # 'organize'
stemmer.stem('beautiful') # 'beauty'
family = stemmer.get_word_family('organize')
# ['organize', 'organizer', 'organization', ...]
Inflectional Analysis
from crosstem import InflectionAnalyzer
inflector = InflectionAnalyzer('eng')
inflector.get_lemma('running') # 'run'
forms = inflector.get_inflections('run')
# [{'form': 'runs', 'pos': 'V', ...}, ...]
Etymology (Requires Download)
from crosstem import EtymologyLinker, download_etymology
# Download etymology data first (one-time, ~1 GB)
download_etymology()
linker = EtymologyLinker()
chain = linker.trace_origin_chain('portmanteau', 'English')
# [{'term': 'portmanteau', 'lang': 'English'},
# {'term': 'portemanteau', 'lang': 'Middle French', ...}]
Supported Languages
15 languages with derivational morphology:
- English (eng), Russian (rus), French (fra), German (deu), Spanish (spa)
- Portuguese (por), Italian (ita), Polish (pol), Czech (ces)
- Serbo-Croatian (hbs), Hungarian (hun), Finnish (fin)
- Swedish (swe), Mongolian (mon), Catalan (cat)
Plus 2,265 languages with etymology data.
How It Works
Theoretical Framework
Crosstem is built on three pillars of morphological linguistics:
1. Derivational Morphology (Word Formation)
Unlike inflection (which modifies words grammatically), derivation creates new words by adding affixes or converting between parts of speech:
organize+-ation→organization(verb → noun)beauty+-ful→beautiful(noun → adjective)organize+-er→organizer(agent noun)
Crosstem models this as a directed graph where:
- Nodes = word forms with POS tags
- Edges = derivational relationships (affixes, conversions)
- Stemming = graph traversal to find the root (preferring verbs and shorter forms)
┌─────────────┐
│ organize │ ← ROOT (verb, shortest)
│ (V) │
└──────┬──────┘
┌────┴────┬───────┬──────────┐
▼ ▼ ▼ ▼
organizer organization organized reorganize
(N) (N) (ADJ) (V)
│
▼
organizational
(ADJ)
This graph-based approach ensures linguistically accurate roots, avoiding the overstemming problem of rule-based stemmers.
2. Inflectional Morphology (Grammatical Forms)
Inflection expresses grammatical categories without changing core meaning:
- Number:
cat→cats - Tense:
run→ran,running - Comparison:
good→better,best
Crosstem stores inflectional paradigms as lemma → forms mappings:
{
"run": {
"pos": "V",
"forms": {
"runs": [{"pos": "V", "features": "PRS;3;SG"}],
"running": [{"pos": "V", "features": "V.PTCP;PRS"}],
"ran": [{"pos": "V", "features": "PST"}]
}
}
}
This enables both lemmatization (running → run) and paradigm generation (run → all forms).
3. Cross-Lingual Etymology (Historical Linguistics)
Etymology traces how words evolve and transfer across languages:
- Borrowing: English
portmanteau← Middle Frenchportemanteau - Cognates: Dutch
woordenboek↔ GermanWörterbuch(shared Germanic ancestor) - Inheritance: Latin
mater→ Frenchmère, Italianmadre, Spanishmadre
Crosstem represents this as a multilingual graph with typed edges:
English: "portmanteau" ──borrowed_from──→ Middle French: "portemanteau"
│
has_root: "porter" (to carry)
│
has_root: "manteau" (coat)
Implementation
All three frameworks are implemented as fast JSON lookups with graph traversal algorithms:
- Preprocessing: TSV/CSV data → optimized JSON dictionaries
- Indexing: Multi-level indices (word → derivations, lemma → inflections, term+lang → etymology)
- Traversal: BFS for word families, chain-following for etymology
- Filtering: POS preference (verbs), length minimization, cycle detection
Result: ~0.0008ms per-word stemming on benchmark runs with Rust acceleration enabled.
Stemming Algorithm Details
Multi-Hop BFS Traversal
Crosstem uses a sophisticated breadth-first search algorithm to find the optimal root:
# Example: organizational → organization → organize
organizational (14 chars, ADJ)
↓ (depth 1)
organization (12 chars, N, productivity=16) ← candidate
↓ (depth 2)
organize (8 chars, V, productivity=13) ← BEST (verb, shortest)
Algorithm steps:
- Start from input word, add to queue
- Expand all DERIVED_FROM relationships (parents in morphology graph)
- Score each candidate:
- Length (shorter is better)
- POS (verbs score -10, nouns -5)
- Depth (penalize by +2 per hop)
- Filter by productivity threshold (language-specific):
- English: Verbs ≥5, Others ≥9
- French/Italian: Verbs ≥4, Others ≥5
- German: Verbs ≥4, Others ≥3 (compound-heavy)
- Spanish/Portuguese: Verbs ≥3, Others ≥4
- Russian/Slavic: Verbs ≥3, Others ≥2-3 (lower productivity)
- Continue traversal through low-productivity nodes (enables multi-hop)
- Return lowest-scoring candidate that's shorter than input or a verb
Productivity-Based Filtering
Problem: MorphyNet contains archaic roots (e.g., hap) and data errors (e.g., democracy → democrat)
Solution: Use productivity as a quality signal. Words with many derivations are more likely to be modern, correct roots.
Examples (English thresholds: V≥5, N≥9):
| Word | Productivity | POS | Threshold | Result |
|---|---|---|---|---|
red |
18 derivations | N | ≥9 | ✅ PASS (productive noun) |
run |
33 derivations | V | ≥5 | ✅ PASS (very productive verb) |
destruct |
6 derivations | V | ≥5 | ✅ PASS (verb threshold) |
hap |
8 derivations | N | ≥9 | ❌ FILTERED (archaic) |
democrat |
7 derivations | N | ≥9 | ❌ FILTERED (data error) |
This data-driven approach avoids hard-coded rules while maintaining quality.
Language-Specific Calibration: Thresholds are adjusted for each language based on morphological richness. Languages with lower overall productivity (Russian, Spanish) use lower thresholds to avoid over-filtering, while English uses higher thresholds due to rich derivational data.
Why This Works
Traditional stemmers fail because they use brittle suffix rules:
# Porter stemmer rule: -ation → (remove suffix)
"organization" → "organ" # Lost the "ize" (overstemming)
# Lancaster stemmer: even more aggressive
"organization" → "org" # Completely loses meaning
Crosstem succeeds because it uses linguistic knowledge:
# Graph data knows: organization DERIVED_FROM organize
"organization" → "organize" # Preserves semantic relationship
Data Sources
- MorphyNet v1.0: Derivational and inflectional morphology (CC BY-SA 4.0)
- Wiktionary: Cross-lingual etymology data (CC BY-SA 3.0)
Citation
If you use this library in your research, please cite:
@software{crosstem2025,
title={Crosstem: Comprehensive Morphological Analysis for Python},
author={Avinash Bhojanapalli},
year={2025},
url={https://github.com/droidmaximus/crosstem},
note={A Python package for derivational stemming, inflectional analysis, and cross-lingual etymology}
}
@inproceedings{batsuren2021morphynet,
title={MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology},
author={Batsuren, Khuyagbaatar and Bella, Gábor and Giunchiglia, Fausto},
booktitle={Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology},
pages={39--48},
year={2021}
}
@misc{wiktionary2025,
title={Wiktionary, The Free Dictionary},
author={{Wiktionary contributors}},
year={2025},
url={https://en.wiktionary.org/},
note={Etymology data extracted from Wiktionary dumps}
}
License
- Code: MIT License
- Data: CC BY-SA 4.0 (MorphyNet), CC BY-SA 3.0 (Wiktionary)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crosstem-1.0.0-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: crosstem-1.0.0-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 20.6 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99dd7813151de17084ad7bcd50ad880042d23dc8905b52c68165c0ea77423645
|
|
| MD5 |
f17f383e0b51e66d01826a917b6ac7f0
|
|
| BLAKE2b-256 |
a16b4bec447174fccef6db6728b711b18c38040099a3d9f570ddc68253892f43
|