Skip to main content

Unified API for Russian NLP - combines razdel, pymorphy3, slovnet, natasha

Project description

mawo-core

Unified API for Russian NLP - combines razdel, pymorphy3, slovnet, natasha into a single, spaCy-like interface.

Features

  • Unified API - Single entry point for all MAWO libraries
  • Rich Objects - Document/Token/Span with lazy evaluation
  • Custom Vocabulary - Runtime word additions without DAWG rebuilding
  • Modular Pipeline - Compose only the components you need
  • spaCy-compatible - Familiar API for spaCy users

Installation

# Core (tokenization + morphology)
pip install mawo-core

# Full (with NER and syntax)
pip install mawo-core[all]

Quick Start

from mawo import Russian

# Create analyzer
nlp = Russian()

# Analyze text
doc = nlp("Александр Пушкин родился в Москве")

# Access tokens
for token in doc.tokens:
    print(token.text, token.lemma, token.pos, token.tag)

# Access entities (requires mawo-slovnet)
for ent in doc.entities:
    print(ent.text, ent.label)

# Access sentences
for sent in doc.sentences:
    print(sent.text)

Advanced Usage

Rich Token Objects

doc = nlp("Я читал интересную книгу")

for token in doc.tokens:
    # Morphology (from pymorphy3)
    print(token.lemma)          # "читать"
    print(token.pos)            # "VERB"
    print(token.aspect)         # "imperfective"
    print(token.tense)          # "past"
    print(token.gender)         # "masc"

    # Syntax (from slovnet)
    print(token.dep)            # "ROOT"
    print(token.head)           # None

    # Context
    print(token.children)       # [книгу]
    print(token.ancestors)      # []

Adjective-Noun Pairs

doc = nlp("красивая дом")  # Error: gender mismatch

for pair in doc.adjective_noun_pairs:
    print(pair.adjective)       # Token("красивая")
    print(pair.noun)            # Token("дом")
    print(pair.agreement)       # "incorrect"
    print(pair.gender_match)    # False
    print(pair.suggestion)      # "красивый дом"

Verb Aspects

doc = nlp("Я прочитал книгу")

for verb in doc.verbs:
    print(verb.word)            # "прочитал"
    print(verb.aspect)          # "perfective"
    print(verb.is_perfective)   # True
    print(verb.aspect_pair)     # "читать"

Custom Vocabulary

from mawo import Russian

nlp = Russian()

# Add single word
nlp.vocab.add("блокчейн",
    pos="NOUN",
    gender="masc",
    animacy="inan",
    tags={"domain": "IT"}
)

# Load domain dictionary
nlp.vocab.load_domain("IT")  # блокчейн, API, фреймворк...

# Load from file
nlp.vocab.load("tech_terms.txt")

# Now custom words work
doc = nlp("Блокчейн это технология")
print(doc.tokens[0].pos)  # "NOUN" (from custom vocab)

Custom Pipeline

from mawo import Pipeline

# Minimal pipeline (fast)
nlp = Pipeline([
    "tokenizer",      # razdel
    "morphologizer",  # pymorphy3
])

# Full pipeline
nlp = Pipeline([
    "tokenizer",
    "morphologizer",
    "ner",           # slovnet
    "parser",        # slovnet syntax
])

# Custom pipeline
nlp = Pipeline([
    "tokenizer",
    ("custom", MyCustomComponent()),
    "morphologizer",
])

Entity Preservation

from mawo import Russian

nlp = Russian()

# Check entity preservation in translation
source = nlp("Alexander Pushkin was born in Moscow")
target = nlp("Александр Пушкин родился в Москве")

matches = nlp.match_entities(source, target)

for match in matches:
    print(match.source)         # Entity("Alexander Pushkin", "PER")
    print(match.target)         # Entity("Александр Пушкин", "PER")
    print(match.status)         # "matched"
    print(match.confidence)     # 0.95

Performance

  • Tokenization: ~5000 tokens/sec
  • Morphology: ~5000 words/sec
  • NER: ~1000 tokens/sec
  • Memory: ~60MB (with slovnet)

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Code quality
black .
ruff check .
mypy mawo

License

MIT License - see LICENSE for details.

Part of MAWO Ecosystem

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mawo_core-0.1.1.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mawo_core-0.1.1-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file mawo_core-0.1.1.tar.gz.

File metadata

  • Download URL: mawo_core-0.1.1.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mawo_core-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f44fc062510daadcfd8de76fcd83588989e271ee98d5531a2b06ce82b204178f
MD5 1cfb7b6e854bda0e815a908f02816a7c
BLAKE2b-256 406c5c56ec43aaa3d4e2eef0a71b2f9d5d8d0315ab8c5499f0fb46ba805416be

See more details on using hashes here.

Provenance

The following attestation bundles were made for mawo_core-0.1.1.tar.gz:

Publisher: publish.yml on mawo-ru/mawo-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mawo_core-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: mawo_core-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mawo_core-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a59da2f7f718b703eb87711e34eba9803571d88be7430f57617c1c81f26dccb5
MD5 69368a8b9d4f3d2f134bbbb9113e5d49
BLAKE2b-256 2a83f115fbee776748dcd6079508f64cbb552a2bd7b86ed0a9e69e463575abb9

See more details on using hashes here.

Provenance

The following attestation bundles were made for mawo_core-0.1.1-py3-none-any.whl:

Publisher: publish.yml on mawo-ru/mawo-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page