Unified API for Russian NLP - combines razdel, pymorphy3, slovnet, natasha
Project description
mawo-core
Unified API for Russian NLP - combines razdel, pymorphy3, slovnet, natasha into a single, spaCy-like interface.
Features
- Unified API - Single entry point for all MAWO libraries
- Rich Objects - Document/Token/Span with lazy evaluation
- Custom Vocabulary - Runtime word additions without DAWG rebuilding
- Modular Pipeline - Compose only the components you need
- spaCy-compatible - Familiar API for spaCy users
Installation
# Core (tokenization + morphology)
pip install mawo-core
# Full (with NER and syntax)
pip install mawo-core[all]
Quick Start
from mawo import Russian
# Create analyzer
nlp = Russian()
# Analyze text
doc = nlp("Александр Пушкин родился в Москве")
# Access tokens
for token in doc.tokens:
print(token.text, token.lemma, token.pos, token.tag)
# Access entities (requires mawo-slovnet)
for ent in doc.entities:
print(ent.text, ent.label)
# Access sentences
for sent in doc.sentences:
print(sent.text)
Advanced Usage
Rich Token Objects
doc = nlp("Я читал интересную книгу")
for token in doc.tokens:
# Morphology (from pymorphy3)
print(token.lemma) # "читать"
print(token.pos) # "VERB"
print(token.aspect) # "imperfective"
print(token.tense) # "past"
print(token.gender) # "masc"
# Syntax (from slovnet)
print(token.dep) # "ROOT"
print(token.head) # None
# Context
print(token.children) # [книгу]
print(token.ancestors) # []
Adjective-Noun Pairs
doc = nlp("красивая дом") # Error: gender mismatch
for pair in doc.adjective_noun_pairs:
print(pair.adjective) # Token("красивая")
print(pair.noun) # Token("дом")
print(pair.agreement) # "incorrect"
print(pair.gender_match) # False
print(pair.suggestion) # "красивый дом"
Verb Aspects
doc = nlp("Я прочитал книгу")
for verb in doc.verbs:
print(verb.word) # "прочитал"
print(verb.aspect) # "perfective"
print(verb.is_perfective) # True
print(verb.aspect_pair) # "читать"
Custom Vocabulary
from mawo import Russian
nlp = Russian()
# Add single word
nlp.vocab.add("блокчейн",
pos="NOUN",
gender="masc",
animacy="inan",
tags={"domain": "IT"}
)
# Load domain dictionary
nlp.vocab.load_domain("IT") # блокчейн, API, фреймворк...
# Load from file
nlp.vocab.load("tech_terms.txt")
# Now custom words work
doc = nlp("Блокчейн это технология")
print(doc.tokens[0].pos) # "NOUN" (from custom vocab)
Custom Pipeline
from mawo import Pipeline
# Minimal pipeline (fast)
nlp = Pipeline([
"tokenizer", # razdel
"morphologizer", # pymorphy3
])
# Full pipeline
nlp = Pipeline([
"tokenizer",
"morphologizer",
"ner", # slovnet
"parser", # slovnet syntax
])
# Custom pipeline
nlp = Pipeline([
"tokenizer",
("custom", MyCustomComponent()),
"morphologizer",
])
Entity Preservation
from mawo import Russian
nlp = Russian()
# Check entity preservation in translation
source = nlp("Alexander Pushkin was born in Moscow")
target = nlp("Александр Пушкин родился в Москве")
matches = nlp.match_entities(source, target)
for match in matches:
print(match.source) # Entity("Alexander Pushkin", "PER")
print(match.target) # Entity("Александр Пушкин", "PER")
print(match.status) # "matched"
print(match.confidence) # 0.95
Performance
- Tokenization: ~5000 tokens/sec
- Morphology: ~5000 words/sec
- NER: ~1000 tokens/sec
- Memory: ~60MB (with slovnet)
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Code quality
black .
ruff check .
mypy mawo
License
MIT License - see LICENSE for details.
Part of MAWO Ecosystem
- mawo-pymorphy3 - Morphological analysis
- mawo-razdel - Tokenization
- mawo-slovnet - NER and syntax
- mawo-natasha - Embeddings
- mawo-grammar - Grammar checking
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mawo_core-0.1.1.tar.gz.
File metadata
- Download URL: mawo_core-0.1.1.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f44fc062510daadcfd8de76fcd83588989e271ee98d5531a2b06ce82b204178f
|
|
| MD5 |
1cfb7b6e854bda0e815a908f02816a7c
|
|
| BLAKE2b-256 |
406c5c56ec43aaa3d4e2eef0a71b2f9d5d8d0315ab8c5499f0fb46ba805416be
|
Provenance
The following attestation bundles were made for mawo_core-0.1.1.tar.gz:
Publisher:
publish.yml on mawo-ru/mawo-core
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mawo_core-0.1.1.tar.gz -
Subject digest:
f44fc062510daadcfd8de76fcd83588989e271ee98d5531a2b06ce82b204178f - Sigstore transparency entry: 702381496
- Sigstore integration time:
-
Permalink:
mawo-ru/mawo-core@e6c432b1365c0be90562e8315f05089c76835a99 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/mawo-ru
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e6c432b1365c0be90562e8315f05089c76835a99 -
Trigger Event:
release
-
Statement type:
File details
Details for the file mawo_core-0.1.1-py3-none-any.whl.
File metadata
- Download URL: mawo_core-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a59da2f7f718b703eb87711e34eba9803571d88be7430f57617c1c81f26dccb5
|
|
| MD5 |
69368a8b9d4f3d2f134bbbb9113e5d49
|
|
| BLAKE2b-256 |
2a83f115fbee776748dcd6079508f64cbb552a2bd7b86ed0a9e69e463575abb9
|
Provenance
The following attestation bundles were made for mawo_core-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on mawo-ru/mawo-core
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mawo_core-0.1.1-py3-none-any.whl -
Subject digest:
a59da2f7f718b703eb87711e34eba9803571d88be7430f57617c1c81f26dccb5 - Sigstore transparency entry: 702381500
- Sigstore integration time:
-
Permalink:
mawo-ru/mawo-core@e6c432b1365c0be90562e8315f05089c76835a99 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/mawo-ru
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e6c432b1365c0be90562e8315f05089c76835a99 -
Trigger Event:
release
-
Statement type: