Context-aware Norwegian morphological alternative generator
Project description
AltMorph: Context-Aware Norwegian Morphological Alternative Generator
AltMorph is a tool for expanding Norwegian text by finding morphological alternatives for each word. It combines the Ordbank API with NLP techniques to provide alternatives that fit the surrounding context.
Outputs follow the AltMetrics bracket format, listing options as [original|alt1|alt2].
✨ Features
- 🎯 Context-sensitive filtering: Uses BERT-based acceptability scoring for ambiguous cases
- 📚 Lemma coverage: Finds morphological forms across multiple lemmas
- 🔍 Position-specific analysis: Looks at each word in its syntactic context
- ⚡ Caching: Persistent file-based caching to improve performance
- 🗣️ Multiple verbosity levels: From silent operation to detailed pipeline insights
- 🌐 Language support: Norwegian Bokmål (
nob) and Nynorsk (nno) - 🧠 POS-aware: Uses NbAiLab BERT models for part-of-speech tagging
- 🚀 Parallel processing: Runs concurrent API calls
🛠️ Installation
Prerequisites
- Python 3.8+
- Ordbank API key (free registration at Ordbank)
Install from PyPI
pip install altmorph
Install from Source (development)
pip install -e .
Optional: Sync Development Requirements
pip install -r requirements.txt
Get API Key
- Register at https://www.ordbank.no/
- Obtain your API key from your account dashboard
- Set the environment variable:
export ORDBANK_API_KEY="your_api_key_here"
Or pass it directly with--api_keyflag
🚀 Quick Start
After installation you can invoke the CLI either with the altmorph command or via python -m altmorph.
Basic Usage
python -m altmorph --sentence "Katta ligger på matta." --lang nob
Output:
[Katta|Katten] ligger på [matta|matten].
With API Key
python -m altmorph \
--sentence "Katta ligger på matta." \
--lang nob \
--api_key "your_api_key_here"
📖 Usage Examples
Context-Sensitive Behaviour
The tool takes sentence context into account:
Simple example:
python -m altmorph --sentence "Katta ligger på matta." --lang nob
# Output: [Katta|Katten] ligger på [matta|matten].
# Shows different morphological forms for the same words
Complex context:
python -m altmorph --sentence "Katta ligger på matta i stua." --lang nob
# Output: [Katta|Katten] ligger på [matta|matten] i stua.
# BERT-based filtering keeps alternatives that work in the sentence
Position-Specific Analysis
python -m altmorph --sentence "Katta ligger på matta." --lang nob
# Each word occurrence is analyzed in its specific syntactic context
🎛️ Command Line Options
| Option | Default | Description |
|---|---|---|
--sentence |
required | Input sentence to process |
--lang |
nob |
Language code (nob or nno) |
--api_key |
$ORDBANK_API_KEY |
Ordbank API key |
--verbosity |
0 |
Verbosity level (0-3) |
--logit-threshold |
3.0 |
BERT acceptability threshold |
--timeout |
6.0 |
HTTP timeout per request |
--max_workers |
4 |
Parallel API requests |
--no-cache |
False |
Disable caching |
--delete-cache |
False |
Clear cache and exit |
🔊 Verbosity Levels
Level 0: Quiet (Default)
python -m altmorph --sentence "Katta ligger på matta." --verbosity 0
Output: Just the final result
[Katta|Katten] ligger på [matta|matten].
Level 1: Normal
python -m altmorph --sentence "Katta ligger på matta." --verbosity 1
Output: Basic progress information
2025-XX-XX 12:00:00 INFO Loading POS tagger...
2025-XX-XX 12:00:02 INFO POS tagger loaded
[Katta|Katten] ligger på [matta|matten].
Level 2: Verbose
python -m altmorph --sentence "Katta ligger på matta." --verbosity 2
Output: Processing details (POS tags, API lookups, alternatives found)
🎯 PROCESSING: Katta ligger på matta.
📝 WORDS: ['katta', 'ligger', 'på', 'matta']
🏷️ POS TAGS:
katta: NOUN
ligger: VERB
på: ADP
matta: NOUN
📡 API LOOKUP: katta (POS: NOUN)
✅ katta: 2 alternatives: ['katta', 'katten']
...
✨ RESULT: [Katta|Katten] ligger på [matta|matten].
Level 3: Very Verbose
python -m altmorph --sentence "Katta ligger på matta." --verbosity 3
Output: Everything including cache operations, lemma analysis, BERT filtering
🎯 PROCESSING: Katta ligger på matta.
📝 FOUND 2 LEMMAS for katta
💾 CACHE HIT: lemmas for 'katta' (POS: NOUN)
🧠 ACCEPTABILITY FILTERING (threshold: 3.00)
🔍 ANALYZING: katta (position 0)
Context: [Katta] ligger på matta.
Alternatives: ['katta', 'katten']
📊 CACHE STATS: 8 hits, 0 misses (100.0% hit rate)
...
🗂️ Caching System
AltMorph includes caching to improve performance:
- Cache location:
~/.ordbank_cache/ - Cache types: Lemma searches and inflection data
- Performance: ~95%+ hit rate for repeated usage
- Management:
--no-cache: Disable caching--delete-cache: Clear all cache files
Performance impact:
- First run: ~3-4 seconds (API calls)
- Cached runs: ~0.5 seconds
🧠 Technical Details
Code Architecture Deep-Dive
📖 Complete Code Walkthrough - Detailed technical explanation of how AltMorph works for developers who need implementation details.
Architecture
- Input Processing: Tokenization preserving whitespace and punctuation
- POS Tagging: NbAiLab/nb-bert-base-pos for accurate grammatical analysis
- Lemma Discovery: Comprehensive search across all relevant Ordbank lemmas
- Inflection Analysis: Full morphological paradigm extraction
- Acceptability Scoring: NbAiLab/nb-bert-base for context-sensitive filtering
- Output Generation: Case-preserving alternative presentation
Models Used
- POS Tagging:
NbAiLab/nb-bert-base-pos - Acceptability:
NbAiLab/nb-bert-base - API: Ordbank - Norwegian morphological database
Key Algorithms
- Comprehensive lemma matching: Finds all lemmas containing target word
- Position-specific analysis: Each word occurrence analyzed in context
- Logit-based filtering: Acceptability thresholding (default: 3.0)
- Prioritization: Balances morphological coverage with contextual fit
📊 Performance
Typical Performance
- Single sentence: 0.5-4 seconds (depending on cache state)
- Cache hit rate: Typically 95%+ for repeated usage
- API efficiency: Parallel requests with batching
- Memory usage: ~500MB (loaded BERT models)
Scaling Considerations
- Concurrent requests: Configurable via
--max_workers - Timeout handling: Robust error recovery with retries
- Rate limiting: Respectful API usage patterns
🛠️ Tools
AltMorph includes additional helpers for batch processing and debugging:
corpus_tools/process_jsonl.py: Batch-process JSONL files by adding morphological alternatives to text fields (resume-aware, batched).corpus_tools/create_training_examples.py: Sample one variant per alternative block to generateunnormtraining strings.corpus_tools/stream_ncc_text.py: Stream Stortinget speeches from the NCC dataset on Hugging Face.scripts/pos_tester.py: Compare POS tagging across Norwegian NLP models.scripts/hf_probe_fields.py: Inspect Hugging Face dataset metadata and stream example rows.
Browse corpus_tools/README.md and scripts/README.md for more details.
🔧 Development
Project Structure
altmorph/
├── __init__.py # Main application / CLI
├── data/ # Packaged lemma resources
├── corpus_tools/ # Corpus cleaning scripts and sample data
│ ├── process_jsonl.py # JSONL batch processor
│ ├── create_training_examples.py
│ ├── stream_ncc_text.py
│ └── data/ # Sample + placeholder corpora
├── docs/ # Developer documentation
│ └── code_explanation.md
├── legacy/ # Archived scripts kept for reference
├── scripts/ # Standalone utilities (POS tester, HF helper)
├── README.md # Main documentation
├── pyproject.toml # Packaging metadata
├── requirements.txt # Dependencies
├── setup.py # Legacy packaging shim
└── ~/.ordbank_cache/ # Cache directory (auto-created)
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure code follows existing style
- Submit a pull request
Testing
# Run the automated test suite
pytest
# Test basic functionality
python -m altmorph --sentence "Katta ligger på matta." --lang nob
# Test cache functionality
python -m altmorph --delete-cache
python -m altmorph --sentence "Katta ligger på matta." --lang nob --verbosity 3
# Test without cache
python -m altmorph --sentence "Katta ligger på matta." --lang nob --no-cache
# Test POS comparison tool
python scripts/pos_tester.py --text "Katta ligger på matta."
# Test batch processing with sample data
python corpus_tools/process_jsonl.py \
--input_file corpus_tools/data/samples/sample_input.jsonl \
--output_file tmp/test_output.jsonl \
--verbosity 2
🚢 Release Guide
Ready to publish? Follow the step-by-step instructions in docs/RELEASING.md to build,
test, and upload the package (v0.1.0) to PyPI.
🤝 Related Projects
- altmetrics: Depends on AltMorph's output format for Norwegian text evaluation. Allows you to calculate wer, cer, BLEU and chrF based on valid morphological alternatives.
📄 License
🙏 Acknowledgments
- Ordbank Team: For providing the comprehensive Norwegian morphological API
- Clarino/UiB: For hosting the API infrastructure
- NbAiLab: For the Norwegian BERT models
- AltMorph: Idea and coding by Magnus Breder Birkenes and Per Egil Kummervold
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file altmorph-0.1.0.tar.gz.
File metadata
- Download URL: altmorph-0.1.0.tar.gz
- Upload date:
- Size: 553.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c771ae3c0b483f34709ec5ddd2b0d2015e2064c1e4b357ccfb8d7695ab231397
|
|
| MD5 |
f1eaeac58fd9c76af16ecaa466390b40
|
|
| BLAKE2b-256 |
189054ea9f44938aa407a70a394a1832f6546f37704b3e139633c97969039ad1
|
File details
Details for the file altmorph-0.1.0-py3-none-any.whl.
File metadata
- Download URL: altmorph-0.1.0-py3-none-any.whl
- Upload date:
- Size: 500.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
685cba554b5ba0e2625859eea89fa211bea2d1cbca601cfbf8bc158188fc0209
|
|
| MD5 |
6d80273aa94b8e742aeac8bc3a19bc14
|
|
| BLAKE2b-256 |
902e55c8a6272242180a84b913c75b8fcddeb7bc98c7bd994570d41cef69920b
|