Skip to main content

A type-safe, AI-powered framework for structured language translation

Project description

Yaduha

A type-safe, AI-powered framework for structured language translation

(readme documentation produced by Claude)

Yaduha is a Python framework for building translation systems that combine the power of Large Language Models (LLMs) with formal linguistic structures. It provides tools for creating grammatically-constrained translations with full type safety and verification.

Features

  • ๐Ÿ”ง Type-Safe Tool Framework: Build LLM-callable tools with strict parameter validation
  • ๐Ÿค– Agent Abstraction: Unified interface for AI agents (OpenAI, with extensibility for others)
  • ๐Ÿ“ Structured Sentences: Define language grammars as Pydantic models for guaranteed correctness
  • ๐Ÿ”„ Multiple Translation Strategies: Choose between pipeline-based or free-form agentic translation
  • โœ… Back-Translation Verification: Automatically verify translation quality
  • ๐Ÿ“Š Token & Performance Tracking: Built-in monitoring for costs and latency
  • ๐ŸŽฏ Few-Shot Learning: Automatic example generation for better LLM performance

Quick Start

Installation

pip install -e .

Set up your OpenAI API key

export OPENAI_API_KEY="your-api-key-here"

Basic Usage

from yaduha.translator.pipeline import PipelineTranslator
from yaduha.agent.openai import OpenAIAgent
from yaduha.language.ovp import SubjectVerbSentence, SubjectVerbObjectSentence

# Create a translator
translator = PipelineTranslator(
    agent=OpenAIAgent(
        model="gpt-4o-mini",
        api_key="your-api-key"
    ),
    SentenceType=(SubjectVerbObjectSentence, SubjectVerbSentence)
)

# Translate English to Owens Valley Paiute
result = translator("The dog is sleeping.")
print(f"Translation: {result.target}")
print(f"Back-translation: {result.back_translation.source}")
print(f"Tokens used: {result.prompt_tokens + result.completion_tokens}")

Architecture Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  User Input     โ”‚
โ”‚  (English)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Translator    โ”‚ โ—„โ”€โ”€โ”€ Pipeline or Agentic
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Tools       โ”‚ โ—„โ”€โ”€โ”€ EnglishToSentences, SentenceToEnglish
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Agent       โ”‚ โ—„โ”€โ”€โ”€ OpenAI (gpt-4o, gpt-4o-mini)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Structured    โ”‚
โ”‚   Sentences     โ”‚ โ—„โ”€โ”€โ”€ SubjectVerbSentence, SubjectVerbObjectSentence
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Target Lang    โ”‚
โ”‚  (e.g., OVP)    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Core Concepts

1. Agents

Agents are AI models that can generate text and call tools. Yaduha provides an OpenAI implementation with full type safety:

from yaduha.agent.openai import OpenAIAgent

agent = OpenAIAgent(
    model="gpt-4o-mini",
    api_key="...",
    temperature=0.0  # For deterministic outputs
)

2. Tools

Tools are callable functions that LLMs can use. They have strict type validation and automatic schema generation:

from yaduha.tool import Tool
from typing import ClassVar, List, Dict

class SearchTool(Tool):
    name: ClassVar[str] = "search"
    description: ClassVar[str] = "Search for information"

    def _run(self, query: str, limit: int = 5) -> List[Dict]:
        # Tool implementation
        return [{"result": "..."}]

3. Structured Sentences

Define language grammars as Pydantic models:

from yaduha.language import Sentence
from pydantic import BaseModel

class MyLanguageSentence(Sentence):
    subject: Noun
    verb: Verb

    def __str__(self) -> str:
        # Convert to target language representation
        return f"{self.subject.target}-{self.verb.target}"

    @classmethod
    def get_examples(cls):
        return [
            ("I sleep", MyLanguageSentence(subject=..., verb=...)),
            # More examples...
        ]

4. Translators

Two translation strategies are provided:

Pipeline Translator (Structured)

Guarantees grammatical correctness by constraining output to defined sentence structures:

translator = PipelineTranslator(
    agent=agent,
    SentenceType=(SentenceType1, SentenceType2)  # Can mix multiple types
)

Agentic Translator (Free-form)

Uses LLM reasoning with optional tool assistance for flexible translation:

translator = AgenticTranslator(
    agent=agent,
    system_prompt="You are a translation expert...",
    tools=[SearchTool(), DictionaryTool(), PipelineTranslator(...)]
)

Correctness-First Translation with Structured Outputs

Yaduha implements LLM-Assisted Rule-Based Machine Translation (LLM-RBMT), a novel paradigm designed specifically for no-resource and extremely low-resource languages where traditional neural MT approaches fail due to lack of parallel corpora. Rather than relying on unconstrained text generation, Yaduha leverages Pydantic models as linguistic constraints to guarantee grammatical correctness while harnessing the semantic understanding of large language models.

The Core Innovation: Pydantic Models as Linguistic Grammars

In Yaduha, every grammatical structure (such as SubjectVerbSentence or SubjectVerbObjectSentence) is defined as a Pydantic model that explicitly encodes the syntactic and morphological rules of the target language. These models act as type-safe grammars where each field corresponds to a validated linguistic feature:

  • Part-of-speech categories (Subject, Verb, Object)
  • Morphological features (Person, Plurality, Proximity, Inclusivity)
  • Tense-aspect systems (TenseAspect: past/present/future, simple/continuous/perfect)
  • Language-specific constraints (e.g., fortis/lenis consonant mutation in OVP)

This structured representation enables what we call rule-based sentence synthesis: the LLM never needs to "know" the target language directly. Instead, it acts as a syntactic and structural intermediary, decomposing natural English into structured forms that our grammatical rules can then synthesize into the target language.

How It Works: Structured Outputs via Constrained Decoding

Yaduha leverages OpenAI's Structured Outputs feature (also called constrained decoding) to force the LLM to output responses conforming exactly to our Pydantic schemas. Here's the process:

  1. Schema Generation: Pydantic models are automatically converted to JSON Schema definitions
  2. Constrained Generation: The LLM generates outputs that are guaranteed to conform to the schema
  3. Automatic Validation: Responses are validated at runtime, ensuring grammatical correctness
  4. Sentence Synthesis: Valid structured data is rendered into the target language using linguistic rules

For example:

from yaduha.translator.pipeline import PipelineTranslator
from yaduha.agent.openai import OpenAIAgent
from yaduha.language.ovp import SubjectVerbSentence, SubjectVerbObjectSentence

translator = PipelineTranslator(
    agent=OpenAIAgent(model="gpt-4o-mini"),
    SentenceType=(SubjectVerbObjectSentence, SubjectVerbSentence)
)

# Input: Complex English sentence
result = translator("The dog is sitting at the lakeside, drinking some water.")

# Output: Grammatically valid OVP sentence(s)
print(f"OVP: {result.target}")
print(f"Back-translation: {result.back_translation.source}")

Behind the scenes, the LLM performs sentence segmentation: breaking down the input into simple SV/SVO structures that match our defined sentence types. Each segment is then validated against the Pydantic schema, ensuring every generated sentence is well-formed according to OVP's grammatical rules.

Why This Matters for Endangered Languages

This correctness-first approach is particularly crucial for endangered and no-resource languages because:

  • No parallel data required: The system works with only a lexicon and grammatical rules --- no bilingual corpus needed
  • Guaranteed grammatical validity: Every output is structurally correct by construction
  • Suitable for language learning: Learners can trust the grammatical correctness of generated sentences
  • Extensible: Adding new vocabulary or grammatical patterns is straightforward
  • Transparent: The structured intermediate representation is human-readable and debuggable

Learn More

For more information including the evaluation methodology and empirical results demonstrating this approach, please read our paper:

๐Ÿ“„ LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages

Currently Supported Languages

Owens Valley Paiute (OVP)

Yaduha includes a complete implementation for Owens Valley Paiute, a Uto-Aztecan language:

  • 37 nouns (coyote, dog, water, mountain, etc.)
  • 35 verbs (14 transitive, 21 intransitive)
  • Full pronoun system (person, number, proximity, inclusivity)
  • Tense/aspect system (6 tenses: past simple/continuous, present simple/continuous/perfect, future)
  • Complex morphology (fortis/lenis consonant mutation, proximity-based suffixes)

Sentence structures:

  • Subject-Verb: "I sleep" โ†’ "nรผรผ รผwi-dรผ"
  • Subject-Verb-Object: "You read the mountains" โ†’ "รผรผ toyabi-noka ui-nia-dรผ"

Examples

Example 1: Basic Translation

from yaduha.translator.pipeline import PipelineTranslator
from yaduha.agent.openai import OpenAIAgent
from yaduha.language.ovp import SubjectVerbObjectSentence
import os

translator = PipelineTranslator(
    agent=OpenAIAgent(
        model="gpt-4o-mini",
        api_key=os.environ["OPENAI_API_KEY"]
    ),
    SentenceType=SubjectVerbObjectSentence
)

result = translator("The cat drinks water")
print(f"English: {result.source}")
print(f"OVP: {result.target}")
print(f"Verification: {result.back_translation.source}")

Example 2: Custom Tools

from yaduha.translator.agentic import AgenticTranslator
from yaduha.tool import Tool
import requests

class DictionaryTool(Tool):
    name: ClassVar[str] = "dictionary_lookup"
    description: ClassVar[str] = "Look up word translations"

    def _run(self, word: str) -> List[Dict]:
        response = requests.get(f"https://api.example.com/lookup?word={word}")
        return response.json()

translator = AgenticTranslator(
    agent=agent,
    tools=[DictionaryTool()]
)

result = translator("How do you say 'hello' in Paiute?")
print(result.target)
print(f"Confidence: {result.metadata['confidence_level']}")

Example 3: Token Tracking

result = translator("A complex sentence to translate")

print(f"Translation time: {result.translation_time:.2f}s")
print(f"Forward tokens: {result.prompt_tokens + result.completion_tokens}")
print(f"Back-translation tokens: {result.back_translation.prompt_tokens + result.back_translation.completion_tokens}")
print(f"Total cost (approx): ${(result.prompt_tokens * 0.15 + result.completion_tokens * 0.60) / 1_000_000:.4f}")

Documentation

Project Structure

yaduha-2/
โ”œโ”€โ”€ yaduha/                    # Main package
โ”‚   โ”œโ”€โ”€ agent/                 # AI agent abstraction
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ openai.py
โ”‚   โ”œโ”€โ”€ tool/                  # Tool framework
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ english_to_sentences.py
โ”‚   โ”‚   โ””โ”€โ”€ sentence_to_english.py
โ”‚   โ”œโ”€โ”€ translator/            # Translation strategies
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ pipeline.py
โ”‚   โ”‚   โ””โ”€โ”€ agentic.py
โ”‚   โ””โ”€โ”€ language/              # Language implementations
โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚       โ””โ”€โ”€ ovp/               # Owens Valley Paiute
โ”‚           โ”œโ”€โ”€ __init__.py
โ”‚           โ”œโ”€โ”€ vocab.py
โ”‚           โ””โ”€โ”€ prompts.py
โ”œโ”€โ”€ scripts/                   # Example scripts
โ”œโ”€โ”€ docs/                      # Documentation
โ””โ”€โ”€ setup.py

Development

Please find the Development Documentation at docs/index.md.

Running Tests

# Test pipeline translator
python scripts/test_pipeline_translator.py

# Test agentic translator
python scripts/test_agentic_translator.py

# Test agent functionality
python scripts/test_agent.py

Citation

If you use Yaduha in your research, please cite:

@software{yaduha2024,
  title={Yaduha: A Type-Safe Framework for Structured Language Translation},
  author={[Your Name]},
  year={2024},
  url={https://github.com/[your-username]/yaduha}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yaduha-0.3.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yaduha-0.3-py3-none-any.whl (42.8 kB view details)

Uploaded Python 3

File details

Details for the file yaduha-0.3.tar.gz.

File metadata

  • Download URL: yaduha-0.3.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for yaduha-0.3.tar.gz
Algorithm Hash digest
SHA256 eb6c3ee63b97e07de3ce664bd435e55befaedb57d4dbf90f50e3ef32f2776c49
MD5 18c9f2ed1da06bea66e95835c8c991df
BLAKE2b-256 aec8c15c603caf6c40b3c1d9c5b214e89b6356fc7619367db6571c1bef5936ed

See more details on using hashes here.

File details

Details for the file yaduha-0.3-py3-none-any.whl.

File metadata

  • Download URL: yaduha-0.3-py3-none-any.whl
  • Upload date:
  • Size: 42.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for yaduha-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c7ecd89b9eb88c2cab451715b988a5465c1d03bce90a541646ac0850e9380782
MD5 4d8ca93d53d7df3feb5e1e8a53172762
BLAKE2b-256 392145dc589217e9bec70e3c6768b39860e2b11cf3bba6ce39d852ed6e65b4f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page