Comprehensive morphological analysis: derivational stemming, inflectional analysis, and cross-lingual etymology

These details have not been verified by PyPI

Project links

Project description

Crosstem

A comprehensive Python package for morphological analysis combining derivational stemming, inflectional analysis, and cross-lingual etymology.

What's New in 1.0

Rust-accelerated derivational stemming backend via PyO3
Automatic fallback to pure-Python derivational logic when Rust extension is unavailable
Backend parity coverage for stem, get_derivations, and get_word_family
Updated benchmark harness with active-backend vs Python-fallback comparisons
Production-stable 1.0.0 packaging metadata

Why Crosstem?

Crosstem finds true linguistic roots across part-of-speech boundaries, which is something traditional stemmers and lemmatizers cannot do.

What Makes It Different

# Traditional stemmers (Porter, Lancaster) - Rule-based, prone to errors
Porter: "organization" → "organ"        # Overstemming loses meaning

# Lemmatizers (WordNet, spaCy) - Only handle inflections, not derivations  
WordNet: "organization" → "organization"  # Can't cross POS boundaries
WordNet: "beautiful" → "beautiful"        # Stuck at adjective form

# Crosstem - Linguistically accurate, crosses POS boundaries
Crosstem: "organization" → "organize"   # Noun → Verb (true root)
Crosstem: "beautiful" → "beauty"        # Adjective → Noun (semantic base)

Key Advantages

Cross-POS derivational stemming: Only library that finds roots across parts of speech
Linguistic accuracy: Uses MorphyNet morphological data, not brittle rules
Etymology tracing: 4.2M relationships across 2,265 languages (unique feature)
Word families: Discover complete derivational networks (e.g., organize → 43 related words)
Fast hybrid runtime: Rust-accelerated derivational engine with automatic pure-Python fallback
15 languages: Multilingual morphology support out of the box

Performance Benchmark vs Porter

We compared Crosstem against the widely-used Porter stemmer on 44 English words with 1,000 iterations each.

Speed Results

Crosstem:     ~0.036s (~1,217,000 words/sec)
Porter:       ~0.490s (~90,000 words/sec)

⚡ Crosstem is ~13× FASTER than Porter

Why? Crosstem uses O(1) hash lookups in JSON dictionaries, while Porter applies sequential pattern-matching rules.

Note: Results averaged over multiple runs; ±3% variance is normal due to system load.

Accuracy Comparison

Word	Crosstem	Porter	Winner
organization	organize	organ	✅ Crosstem (finds true root)
organizational	organize	organiz	✅ Crosstem (multi-hop)
beautiful	beauty	beauti	✅ Crosstem (crosses POS)
destruction	destruct	destruct	⚖️ Tie
democracy	democracy	democraci	✅ Crosstem (avoids error)
computerization	compute	computer	✅ Crosstem (deeper root)
happiness	happy	happi	✅ Crosstem (productivity filter avoids "hap")
redness	red	red	⚖️ Tie

Key Findings:

Cross-POS stemming: Crosstem finds roots across parts of speech (organization → organize, verb), Porter cannot
Overstemming prevention: Porter creates non-words (beauti, organiz), Crosstem always produces real words
Data quality: Crosstem filters bad roots (democrat), Porter has no quality control
Multi-hop: Crosstem traverses multiple derivations (organizational → organization → organize), Porter only strips one suffix

When to Use Each

Choose Crosstem when:

✅ Need linguistically accurate roots
✅ Working with derivational families (organize/organizer/organization)
✅ Building semantic search, clustering, or word embeddings
✅ Quality matters more than simplicity
✅ Multilingual support needed (15 languages)

Choose Porter when:

✅ Legacy system compatibility required
✅ Working with noisy/misspelled text (rule-based is robust)
✅ Only need basic suffix normalization
✅ Want the absolute simplest possible solution

Note: Crosstem is now faster than Porter while being more accurate, making it the better choice for most modern NLP applications.

Features

Derivational Stemming: Find roots across part-of-speech boundaries (organization → organize)
Inflectional Analysis: Lemmatization and grammatical forms (running → run)
Cross-Lingual Etymology: Trace word origins across 2,265 languages
Word Family Analysis: Discover complete derivational networks

Installation

pip install crosstem

crosstem now supports an accelerated Rust derivational backend (PyO3). When a prebuilt wheel includes the extension, it is used automatically. If the Rust extension is unavailable, Crosstem falls back to the pure-Python derivational implementation.

Build From Source With Rust Backend

pip install maturin
maturin develop --manifest-path rust/Cargo.toml

To force the pure-Python path for comparison/debugging:

from crosstem import DerivationalStemmer
stemmer = DerivationalStemmer("eng", use_rust_backend=False)

Optional: Etymology Data

Etymology features require additional data (~1 GB) that's downloaded separately:

from crosstem import download_etymology

# One-time download (saves to package data directory)
download_etymology()

Or from command line:

python -m crosstem.download

Quick Start

Basic Morphological Analysis

from crosstem import MorphologyAnalyzer

# Works immediately - no etymology needed
analyzer = MorphologyAnalyzer('eng', load_etymology=False)
result = analyzer.analyze('organizations')

print(result['derivational_stem'])    # 'organize'
print(result['inflectional_lemma'])   # 'organization'

With Etymology Features

from crosstem import MorphologyAnalyzer, download_etymology

# Download etymology data first (one-time)
if not MorphologyAnalyzer.is_etymology_available():
    download_etymology()

# Now etymology features work
analyzer = MorphologyAnalyzer('eng', load_etymology=True)
result = analyzer.analyze('portmanteau')
print(result['etymology'])  # Shows Middle French origin

Usage

Derivational Stemming

from crosstem import DerivationalStemmer

stemmer = DerivationalStemmer('eng')
stemmer.stem('organization')  # 'organize'
stemmer.stem('beautiful')     # 'beauty'

family = stemmer.get_word_family('organize')
# ['organize', 'organizer', 'organization', ...]

Inflectional Analysis

from crosstem import InflectionAnalyzer

inflector = InflectionAnalyzer('eng')
inflector.get_lemma('running')  # 'run'

forms = inflector.get_inflections('run')
# [{'form': 'runs', 'pos': 'V', ...}, ...]

Etymology (Requires Download)

from crosstem import EtymologyLinker, download_etymology

# Download etymology data first (one-time, ~1 GB)
download_etymology()

linker = EtymologyLinker()
chain = linker.trace_origin_chain('portmanteau', 'English')
# [{'term': 'portmanteau', 'lang': 'English'},
#  {'term': 'portemanteau', 'lang': 'Middle French', ...}]

Supported Languages

15 languages with derivational morphology:

English (eng), Russian (rus), French (fra), German (deu), Spanish (spa)
Portuguese (por), Italian (ita), Polish (pol), Czech (ces)
Serbo-Croatian (hbs), Hungarian (hun), Finnish (fin)
Swedish (swe), Mongolian (mon), Catalan (cat)

Plus 2,265 languages with etymology data.

How It Works

Theoretical Framework

Crosstem is built on three pillars of morphological linguistics:

1. Derivational Morphology (Word Formation)

Unlike inflection (which modifies words grammatically), derivation creates new words by adding affixes or converting between parts of speech:

organize + -ation → organization (verb → noun)
beauty + -ful → beautiful (noun → adjective)
organize + -er → organizer (agent noun)

Crosstem models this as a directed graph where:

Nodes = word forms with POS tags
Edges = derivational relationships (affixes, conversions)
Stemming = graph traversal to find the root (preferring verbs and shorter forms)

         ┌─────────────┐
         │  organize   │ ← ROOT (verb, shortest)
         │     (V)     │
         └──────┬──────┘
           ┌────┴────┬───────┬──────────┐
           ▼         ▼       ▼          ▼
      organizer  organization  organized  reorganize
         (N)        (N)         (ADJ)      (V)
                     │
                     ▼
             organizational
                  (ADJ)

This graph-based approach ensures linguistically accurate roots, avoiding the overstemming problem of rule-based stemmers.

2. Inflectional Morphology (Grammatical Forms)

Inflection expresses grammatical categories without changing core meaning:

Number: cat → cats
Tense: run → ran, running
Comparison: good → better, best

Crosstem stores inflectional paradigms as lemma → forms mappings:

{
  "run": {
    "pos": "V",
    "forms": {
      "runs": [{"pos": "V", "features": "PRS;3;SG"}],
      "running": [{"pos": "V", "features": "V.PTCP;PRS"}],
      "ran": [{"pos": "V", "features": "PST"}]
    }
  }
}

This enables both lemmatization (running → run) and paradigm generation (run → all forms).

3. Cross-Lingual Etymology (Historical Linguistics)

Etymology traces how words evolve and transfer across languages:

Borrowing: English portmanteau ← Middle French portemanteau
Cognates: Dutch woordenboek ↔ German Wörterbuch (shared Germanic ancestor)
Inheritance: Latin mater → French mère, Italian madre, Spanish madre

Crosstem represents this as a multilingual graph with typed edges:

English: "portmanteau" ──borrowed_from──→ Middle French: "portemanteau"
                                               │
                                        has_root: "porter" (to carry)
                                               │
                                        has_root: "manteau" (coat)

Implementation

All three frameworks are implemented as fast JSON lookups with graph traversal algorithms:

Preprocessing: TSV/CSV data → optimized JSON dictionaries
Indexing: Multi-level indices (word → derivations, lemma → inflections, term+lang → etymology)
Traversal: BFS for word families, chain-following for etymology
Filtering: POS preference (verbs), length minimization, cycle detection

Result: ~0.0008ms per-word stemming on benchmark runs with Rust acceleration enabled.

Stemming Algorithm Details

Multi-Hop BFS Traversal

Crosstem uses a sophisticated breadth-first search algorithm to find the optimal root:

# Example: organizational → organization → organize
organizational  (14 chars, ADJ)
    ↓ (depth 1)
organization    (12 chars, N, productivity=16) ← candidate
    ↓ (depth 2)  
organize        (8 chars, V, productivity=13)  ← BEST (verb, shortest)

Algorithm steps:

Start from input word, add to queue
Expand all DERIVED_FROM relationships (parents in morphology graph)
Score each candidate:
- Length (shorter is better)
- POS (verbs score -10, nouns -5)
- Depth (penalize by +2 per hop)
Filter by productivity threshold (language-specific):
- English: Verbs ≥5, Others ≥9
- French/Italian: Verbs ≥4, Others ≥5
- German: Verbs ≥4, Others ≥3 (compound-heavy)
- Spanish/Portuguese: Verbs ≥3, Others ≥4
- Russian/Slavic: Verbs ≥3, Others ≥2-3 (lower productivity)
Continue traversal through low-productivity nodes (enables multi-hop)
Return lowest-scoring candidate that's shorter than input or a verb

Productivity-Based Filtering

Problem: MorphyNet contains archaic roots (e.g., hap) and data errors (e.g., democracy → democrat)

Solution: Use productivity as a quality signal. Words with many derivations are more likely to be modern, correct roots.

Examples (English thresholds: V≥5, N≥9):

Word	Productivity	POS	Threshold	Result
`red`	18 derivations	N	≥9	✅ PASS (productive noun)
`run`	33 derivations	V	≥5	✅ PASS (very productive verb)
`destruct`	6 derivations	V	≥5	✅ PASS (verb threshold)
`hap`	8 derivations	N	≥9	❌ FILTERED (archaic)
`democrat`	7 derivations	N	≥9	❌ FILTERED (data error)

This data-driven approach avoids hard-coded rules while maintaining quality.

Language-Specific Calibration: Thresholds are adjusted for each language based on morphological richness. Languages with lower overall productivity (Russian, Spanish) use lower thresholds to avoid over-filtering, while English uses higher thresholds due to rich derivational data.

Why This Works

Traditional stemmers fail because they use brittle suffix rules:

# Porter stemmer rule: -ation → (remove suffix)
"organization" → "organ"  # Lost the "ize" (overstemming)

# Lancaster stemmer: even more aggressive
"organization" → "org"    # Completely loses meaning

Crosstem succeeds because it uses linguistic knowledge:

# Graph data knows: organization DERIVED_FROM organize
"organization" → "organize"  # Preserves semantic relationship

Data Sources

MorphyNet v1.0: Derivational and inflectional morphology (CC BY-SA 4.0)
Wiktionary: Cross-lingual etymology data (CC BY-SA 3.0)

Citation

If you use this library in your research, please cite:

@software{crosstem2025,
  title={Crosstem: Comprehensive Morphological Analysis for Python},
  author={Avinash Bhojanapalli},
  year={2025},
  url={https://github.com/droidmaximus/crosstem},
  note={A Python package for derivational stemming, inflectional analysis, and cross-lingual etymology}
}

@inproceedings{batsuren2021morphynet,
  title={MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology},
  author={Batsuren, Khuyagbaatar and Bella, Gábor and Giunchiglia, Fausto},
  booktitle={Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology},
  pages={39--48},
  year={2021}
}

@misc{wiktionary2025,
  title={Wiktionary, The Free Dictionary},
  author={{Wiktionary contributors}},
  year={2025},
  url={https://en.wiktionary.org/},
  note={Etymology data extracted from Wiktionary dumps}
}

License

Code: MIT License
Data: CC BY-SA 4.0 (MorphyNet), CC BY-SA 3.0 (Wiktionary)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Mar 23, 2026

0.2.1

Nov 18, 2025

0.2.0

Nov 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crosstem-1.0.0-cp312-cp312-win_amd64.whl (20.6 MB view details)

Uploaded Mar 23, 2026 CPython 3.12Windows x86-64

File details

Details for the file crosstem-1.0.0-cp312-cp312-win_amd64.whl.

File metadata

Download URL: crosstem-1.0.0-cp312-cp312-win_amd64.whl
Upload date: Mar 23, 2026
Size: 20.6 MB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for crosstem-1.0.0-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`99dd7813151de17084ad7bcd50ad880042d23dc8905b52c68165c0ea77423645`
MD5	`f17f383e0b51e66d01826a917b6ac7f0`
BLAKE2b-256	`a16b4bec447174fccef6db6728b711b18c38040099a3d9f570ddc68253892f43`

See more details on using hashes here.

crosstem 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Crosstem

What's New in 1.0

Why Crosstem?

What Makes It Different

Key Advantages

Performance Benchmark vs Porter

Speed Results

Accuracy Comparison

When to Use Each

Features

Installation

Build From Source With Rust Backend

Optional: Etymology Data

Quick Start

Basic Morphological Analysis

With Etymology Features

Usage

Derivational Stemming

Inflectional Analysis

Etymology (Requires Download)

Supported Languages

How It Works

Theoretical Framework

1. Derivational Morphology (Word Formation)

2. Inflectional Morphology (Grammatical Forms)

3. Cross-Lingual Etymology (Historical Linguistics)

Implementation

Stemming Algorithm Details

Multi-Hop BFS Traversal

Productivity-Based Filtering

Why This Works

Data Sources

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes