Skip to main content

Comprehensive morphological analysis: derivational stemming, inflectional analysis, and cross-lingual etymology

Project description

Crosstem

PyPI version Documentation Status Python 3.7+ License: MIT

A comprehensive Python package for morphological analysis combining derivational stemming, inflectional analysis, and cross-lingual etymology.

What's New in 1.0

  • Rust-accelerated derivational stemming backend via PyO3
  • Automatic fallback to pure-Python derivational logic when Rust extension is unavailable
  • Backend parity coverage for stem, get_derivations, and get_word_family
  • Updated benchmark harness with active-backend vs Python-fallback comparisons
  • Production-stable 1.0.0 packaging metadata

Why Crosstem?

Crosstem finds true linguistic roots across part-of-speech boundaries, which is something traditional stemmers and lemmatizers cannot do.

What Makes It Different

# Traditional stemmers (Porter, Lancaster) - Rule-based, prone to errors
Porter: "organization"  "organ"        # Overstemming loses meaning

# Lemmatizers (WordNet, spaCy) - Only handle inflections, not derivations  
WordNet: "organization"  "organization"  # Can't cross POS boundaries
WordNet: "beautiful"  "beautiful"        # Stuck at adjective form

# Crosstem - Linguistically accurate, crosses POS boundaries
Crosstem: "organization"  "organize"   # Noun → Verb (true root)
Crosstem: "beautiful"  "beauty"        # Adjective → Noun (semantic base)

Key Advantages

  • Cross-POS derivational stemming: Only library that finds roots across parts of speech
  • Linguistic accuracy: Uses MorphyNet morphological data, not brittle rules
  • Etymology tracing: 4.2M relationships across 2,265 languages (unique feature)
  • Word families: Discover complete derivational networks (e.g., organize → 43 related words)
  • Fast hybrid runtime: Rust-accelerated derivational engine with automatic pure-Python fallback
  • 15 languages: Multilingual morphology support out of the box

Performance Benchmark vs Porter

We compared Crosstem against the widely-used Porter stemmer on 44 English words with 1,000 iterations each.

Speed Results

Crosstem:     ~0.036s (~1,217,000 words/sec)
Porter:       ~0.490s (~90,000 words/sec)

⚡ Crosstem is ~13× FASTER than Porter

Why? Crosstem uses O(1) hash lookups in JSON dictionaries, while Porter applies sequential pattern-matching rules.

Note: Results averaged over multiple runs; ±3% variance is normal due to system load.

Accuracy Comparison

Word Crosstem Porter Winner
organization organize organ ✅ Crosstem (finds true root)
organizational organize organiz ✅ Crosstem (multi-hop)
beautiful beauty beauti ✅ Crosstem (crosses POS)
destruction destruct destruct ⚖️ Tie
democracy democracy democraci ✅ Crosstem (avoids error)
computerization compute computer ✅ Crosstem (deeper root)
happiness happy happi ✅ Crosstem (productivity filter avoids "hap")
redness red red ⚖️ Tie

Key Findings:

  1. Cross-POS stemming: Crosstem finds roots across parts of speech (organizationorganize, verb), Porter cannot
  2. Overstemming prevention: Porter creates non-words (beauti, organiz), Crosstem always produces real words
  3. Data quality: Crosstem filters bad roots (democrat), Porter has no quality control
  4. Multi-hop: Crosstem traverses multiple derivations (organizationalorganizationorganize), Porter only strips one suffix

When to Use Each

Choose Crosstem when:

  • ✅ Need linguistically accurate roots
  • ✅ Working with derivational families (organize/organizer/organization)
  • ✅ Building semantic search, clustering, or word embeddings
  • ✅ Quality matters more than simplicity
  • ✅ Multilingual support needed (15 languages)

Choose Porter when:

  • ✅ Legacy system compatibility required
  • ✅ Working with noisy/misspelled text (rule-based is robust)
  • ✅ Only need basic suffix normalization
  • ✅ Want the absolute simplest possible solution

Note: Crosstem is now faster than Porter while being more accurate, making it the better choice for most modern NLP applications.

Features

  • Derivational Stemming: Find roots across part-of-speech boundaries (organization → organize)
  • Inflectional Analysis: Lemmatization and grammatical forms (running → run)
  • Cross-Lingual Etymology: Trace word origins across 2,265 languages
  • Word Family Analysis: Discover complete derivational networks

Installation

pip install crosstem

crosstem now supports an accelerated Rust derivational backend (PyO3). When a prebuilt wheel includes the extension, it is used automatically. If the Rust extension is unavailable, Crosstem falls back to the pure-Python derivational implementation.

Build From Source With Rust Backend

pip install maturin
maturin develop --manifest-path rust/Cargo.toml

To force the pure-Python path for comparison/debugging:

from crosstem import DerivationalStemmer
stemmer = DerivationalStemmer("eng", use_rust_backend=False)

Optional: Etymology Data

Etymology features require additional data (~1 GB) that's downloaded separately:

from crosstem import download_etymology

# One-time download (saves to package data directory)
download_etymology()

Or from command line:

python -m crosstem.download

Quick Start

Basic Morphological Analysis

from crosstem import MorphologyAnalyzer

# Works immediately - no etymology needed
analyzer = MorphologyAnalyzer('eng', load_etymology=False)
result = analyzer.analyze('organizations')

print(result['derivational_stem'])    # 'organize'
print(result['inflectional_lemma'])   # 'organization'

With Etymology Features

from crosstem import MorphologyAnalyzer, download_etymology

# Download etymology data first (one-time)
if not MorphologyAnalyzer.is_etymology_available():
    download_etymology()

# Now etymology features work
analyzer = MorphologyAnalyzer('eng', load_etymology=True)
result = analyzer.analyze('portmanteau')
print(result['etymology'])  # Shows Middle French origin

Usage

Derivational Stemming

from crosstem import DerivationalStemmer

stemmer = DerivationalStemmer('eng')
stemmer.stem('organization')  # 'organize'
stemmer.stem('beautiful')     # 'beauty'

family = stemmer.get_word_family('organize')
# ['organize', 'organizer', 'organization', ...]

Inflectional Analysis

from crosstem import InflectionAnalyzer

inflector = InflectionAnalyzer('eng')
inflector.get_lemma('running')  # 'run'

forms = inflector.get_inflections('run')
# [{'form': 'runs', 'pos': 'V', ...}, ...]

Etymology (Requires Download)

from crosstem import EtymologyLinker, download_etymology

# Download etymology data first (one-time, ~1 GB)
download_etymology()

linker = EtymologyLinker()
chain = linker.trace_origin_chain('portmanteau', 'English')
# [{'term': 'portmanteau', 'lang': 'English'},
#  {'term': 'portemanteau', 'lang': 'Middle French', ...}]

Supported Languages

15 languages with derivational morphology:

  • English (eng), Russian (rus), French (fra), German (deu), Spanish (spa)
  • Portuguese (por), Italian (ita), Polish (pol), Czech (ces)
  • Serbo-Croatian (hbs), Hungarian (hun), Finnish (fin)
  • Swedish (swe), Mongolian (mon), Catalan (cat)

Plus 2,265 languages with etymology data.

How It Works

Theoretical Framework

Crosstem is built on three pillars of morphological linguistics:

1. Derivational Morphology (Word Formation)

Unlike inflection (which modifies words grammatically), derivation creates new words by adding affixes or converting between parts of speech:

  • organize + -ationorganization (verb → noun)
  • beauty + -fulbeautiful (noun → adjective)
  • organize + -erorganizer (agent noun)

Crosstem models this as a directed graph where:

  • Nodes = word forms with POS tags
  • Edges = derivational relationships (affixes, conversions)
  • Stemming = graph traversal to find the root (preferring verbs and shorter forms)
         ┌─────────────┐
         │  organize   │ ← ROOT (verb, shortest)
         │     (V)     │
         └──────┬──────┘
           ┌────┴────┬───────┬──────────┐
           ▼         ▼       ▼          ▼
      organizer  organization  organized  reorganize
         (N)        (N)         (ADJ)      (V)
                     │
                     ▼
             organizational
                  (ADJ)

This graph-based approach ensures linguistically accurate roots, avoiding the overstemming problem of rule-based stemmers.

2. Inflectional Morphology (Grammatical Forms)

Inflection expresses grammatical categories without changing core meaning:

  • Number: catcats
  • Tense: runran, running
  • Comparison: goodbetter, best

Crosstem stores inflectional paradigms as lemma → forms mappings:

{
  "run": {
    "pos": "V",
    "forms": {
      "runs": [{"pos": "V", "features": "PRS;3;SG"}],
      "running": [{"pos": "V", "features": "V.PTCP;PRS"}],
      "ran": [{"pos": "V", "features": "PST"}]
    }
  }
}

This enables both lemmatization (running → run) and paradigm generation (run → all forms).

3. Cross-Lingual Etymology (Historical Linguistics)

Etymology traces how words evolve and transfer across languages:

  • Borrowing: English portmanteau ← Middle French portemanteau
  • Cognates: Dutch woordenboek ↔ German Wörterbuch (shared Germanic ancestor)
  • Inheritance: Latin mater → French mère, Italian madre, Spanish madre

Crosstem represents this as a multilingual graph with typed edges:

English: "portmanteau" ──borrowed_from──→ Middle French: "portemanteau"
                                               │
                                        has_root: "porter" (to carry)
                                               │
                                        has_root: "manteau" (coat)

Implementation

All three frameworks are implemented as fast JSON lookups with graph traversal algorithms:

  1. Preprocessing: TSV/CSV data → optimized JSON dictionaries
  2. Indexing: Multi-level indices (word → derivations, lemma → inflections, term+lang → etymology)
  3. Traversal: BFS for word families, chain-following for etymology
  4. Filtering: POS preference (verbs), length minimization, cycle detection

Result: ~0.0008ms per-word stemming on benchmark runs with Rust acceleration enabled.

Stemming Algorithm Details

Multi-Hop BFS Traversal

Crosstem uses a sophisticated breadth-first search algorithm to find the optimal root:

# Example: organizational → organization → organize
organizational  (14 chars, ADJ)
     (depth 1)
organization    (12 chars, N, productivity=16)  candidate
     (depth 2)  
organize        (8 chars, V, productivity=13)   BEST (verb, shortest)

Algorithm steps:

  1. Start from input word, add to queue
  2. Expand all DERIVED_FROM relationships (parents in morphology graph)
  3. Score each candidate:
    • Length (shorter is better)
    • POS (verbs score -10, nouns -5)
    • Depth (penalize by +2 per hop)
  4. Filter by productivity threshold (language-specific):
    • English: Verbs ≥5, Others ≥9
    • French/Italian: Verbs ≥4, Others ≥5
    • German: Verbs ≥4, Others ≥3 (compound-heavy)
    • Spanish/Portuguese: Verbs ≥3, Others ≥4
    • Russian/Slavic: Verbs ≥3, Others ≥2-3 (lower productivity)
  5. Continue traversal through low-productivity nodes (enables multi-hop)
  6. Return lowest-scoring candidate that's shorter than input or a verb

Productivity-Based Filtering

Problem: MorphyNet contains archaic roots (e.g., hap) and data errors (e.g., democracydemocrat)

Solution: Use productivity as a quality signal. Words with many derivations are more likely to be modern, correct roots.

Examples (English thresholds: V≥5, N≥9):

Word Productivity POS Threshold Result
red 18 derivations N ≥9 ✅ PASS (productive noun)
run 33 derivations V ≥5 ✅ PASS (very productive verb)
destruct 6 derivations V ≥5 ✅ PASS (verb threshold)
hap 8 derivations N ≥9 ❌ FILTERED (archaic)
democrat 7 derivations N ≥9 ❌ FILTERED (data error)

This data-driven approach avoids hard-coded rules while maintaining quality.

Language-Specific Calibration: Thresholds are adjusted for each language based on morphological richness. Languages with lower overall productivity (Russian, Spanish) use lower thresholds to avoid over-filtering, while English uses higher thresholds due to rich derivational data.

Why This Works

Traditional stemmers fail because they use brittle suffix rules:

# Porter stemmer rule: -ation → (remove suffix)
"organization"  "organ"  # Lost the "ize" (overstemming)

# Lancaster stemmer: even more aggressive
"organization"  "org"    # Completely loses meaning

Crosstem succeeds because it uses linguistic knowledge:

# Graph data knows: organization DERIVED_FROM organize
"organization"  "organize"  # Preserves semantic relationship

Data Sources

  • MorphyNet v1.0: Derivational and inflectional morphology (CC BY-SA 4.0)
  • Wiktionary: Cross-lingual etymology data (CC BY-SA 3.0)

Citation

If you use this library in your research, please cite:

@software{crosstem2025,
  title={Crosstem: Comprehensive Morphological Analysis for Python},
  author={Avinash Bhojanapalli},
  year={2025},
  url={https://github.com/droidmaximus/crosstem},
  note={A Python package for derivational stemming, inflectional analysis, and cross-lingual etymology}
}

@inproceedings{batsuren2021morphynet,
  title={MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology},
  author={Batsuren, Khuyagbaatar and Bella, Gábor and Giunchiglia, Fausto},
  booktitle={Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology},
  pages={39--48},
  year={2021}
}

@misc{wiktionary2025,
  title={Wiktionary, The Free Dictionary},
  author={{Wiktionary contributors}},
  year={2025},
  url={https://en.wiktionary.org/},
  note={Etymology data extracted from Wiktionary dumps}
}

License

  • Code: MIT License
  • Data: CC BY-SA 4.0 (MorphyNet), CC BY-SA 3.0 (Wiktionary)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crosstem-1.0.0-cp312-cp312-win_amd64.whl (20.6 MB view details)

Uploaded CPython 3.12Windows x86-64

File details

Details for the file crosstem-1.0.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: crosstem-1.0.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 20.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for crosstem-1.0.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 99dd7813151de17084ad7bcd50ad880042d23dc8905b52c68165c0ea77423645
MD5 f17f383e0b51e66d01826a917b6ac7f0
BLAKE2b-256 a16b4bec447174fccef6db6728b711b18c38040099a3d9f570ddc68253892f43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page