A type-safe, AI-powered framework for structured language translation
Project description
Yaduha
A type-safe, AI-powered framework for structured language translation
(readme documentation produced by Claude)
Yaduha is a Python framework for building translation systems that combine the power of Large Language Models (LLMs) with formal linguistic structures. It provides tools for creating grammatically-constrained translations with full type safety and verification.
Features
- ๐ง Type-Safe Tool Framework: Build LLM-callable tools with strict parameter validation
- ๐ค Agent Abstraction: Unified interface for AI agents (OpenAI, with extensibility for others)
- ๐ Structured Sentences: Define language grammars as Pydantic models for guaranteed correctness
- ๐ Multiple Translation Strategies: Choose between pipeline-based or free-form agentic translation
- โ Back-Translation Verification: Automatically verify translation quality
- ๐ Token & Performance Tracking: Built-in monitoring for costs and latency
- ๐ฏ Few-Shot Learning: Automatic example generation for better LLM performance
Quick Start
Installation
pip install -e .
Set up your OpenAI API key
export OPENAI_API_KEY="your-api-key-here"
Basic Usage
from yaduha.translator.pipeline import PipelineTranslator
from yaduha.agent.openai import OpenAIAgent
from yaduha.language.ovp import SubjectVerbSentence, SubjectVerbObjectSentence
# Create a translator
translator = PipelineTranslator(
agent=OpenAIAgent(
model="gpt-4o-mini",
api_key="your-api-key"
),
SentenceType=(SubjectVerbObjectSentence, SubjectVerbSentence)
)
# Translate English to Owens Valley Paiute
result = translator("The dog is sleeping.")
print(f"Translation: {result.target}")
print(f"Back-translation: {result.back_translation.source}")
print(f"Tokens used: {result.prompt_tokens + result.completion_tokens}")
Architecture Overview
โโโโโโโโโโโโโโโโโโโ
โ User Input โ
โ (English) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Translator โ โโโโ Pipeline or Agentic
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Tools โ โโโโ EnglishToSentences, SentenceToEnglish
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Agent โ โโโโ OpenAI (gpt-4o, gpt-4o-mini)
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Structured โ
โ Sentences โ โโโโ SubjectVerbSentence, SubjectVerbObjectSentence
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Target Lang โ
โ (e.g., OVP) โ
โโโโโโโโโโโโโโโโโโโ
Core Concepts
1. Agents
Agents are AI models that can generate text and call tools. Yaduha provides an OpenAI implementation with full type safety:
from yaduha.agent.openai import OpenAIAgent
agent = OpenAIAgent(
model="gpt-4o-mini",
api_key="...",
temperature=0.0 # For deterministic outputs
)
2. Tools
Tools are callable functions that LLMs can use. They have strict type validation and automatic schema generation:
from yaduha.tool import Tool
from typing import ClassVar, List, Dict
class SearchTool(Tool):
name: ClassVar[str] = "search"
description: ClassVar[str] = "Search for information"
def _run(self, query: str, limit: int = 5) -> List[Dict]:
# Tool implementation
return [{"result": "..."}]
3. Structured Sentences
Define language grammars as Pydantic models:
from yaduha.language import Sentence
from pydantic import BaseModel
class MyLanguageSentence(Sentence):
subject: Noun
verb: Verb
def __str__(self) -> str:
# Convert to target language representation
return f"{self.subject.target}-{self.verb.target}"
@classmethod
def get_examples(cls):
return [
("I sleep", MyLanguageSentence(subject=..., verb=...)),
# More examples...
]
4. Translators
Two translation strategies are provided:
Pipeline Translator (Structured)
Guarantees grammatical correctness by constraining output to defined sentence structures:
translator = PipelineTranslator(
agent=agent,
SentenceType=(SentenceType1, SentenceType2) # Can mix multiple types
)
Agentic Translator (Free-form)
Uses LLM reasoning with optional tool assistance for flexible translation:
translator = AgenticTranslator(
agent=agent,
system_prompt="You are a translation expert...",
tools=[SearchTool(), DictionaryTool(), PipelineTranslator(...)]
)
Correctness-First Translation with Structured Outputs
Yaduha implements LLM-Assisted Rule-Based Machine Translation (LLM-RBMT), a novel paradigm designed specifically for no-resource and extremely low-resource languages where traditional neural MT approaches fail due to lack of parallel corpora. Rather than relying on unconstrained text generation, Yaduha leverages Pydantic models as linguistic constraints to guarantee grammatical correctness while harnessing the semantic understanding of large language models.
The Core Innovation: Pydantic Models as Linguistic Grammars
In Yaduha, every grammatical structure (such as SubjectVerbSentence or SubjectVerbObjectSentence) is defined as a Pydantic model that explicitly encodes the syntactic and morphological rules of the target language. These models act as type-safe grammars where each field corresponds to a validated linguistic feature:
- Part-of-speech categories (Subject, Verb, Object)
- Morphological features (
Person,Plurality,Proximity,Inclusivity) - Tense-aspect systems (
TenseAspect: past/present/future, simple/continuous/perfect) - Language-specific constraints (e.g., fortis/lenis consonant mutation in OVP)
This structured representation enables what we call rule-based sentence synthesis: the LLM never needs to "know" the target language directly. Instead, it acts as a syntactic and structural intermediary, decomposing natural English into structured forms that our grammatical rules can then synthesize into the target language.
How It Works: Structured Outputs via Constrained Decoding
Yaduha leverages OpenAI's Structured Outputs feature (also called constrained decoding) to force the LLM to output responses conforming exactly to our Pydantic schemas. Here's the process:
- Schema Generation: Pydantic models are automatically converted to JSON Schema definitions
- Constrained Generation: The LLM generates outputs that are guaranteed to conform to the schema
- Automatic Validation: Responses are validated at runtime, ensuring grammatical correctness
- Sentence Synthesis: Valid structured data is rendered into the target language using linguistic rules
For example:
from yaduha.translator.pipeline import PipelineTranslator
from yaduha.agent.openai import OpenAIAgent
from yaduha.language.ovp import SubjectVerbSentence, SubjectVerbObjectSentence
translator = PipelineTranslator(
agent=OpenAIAgent(model="gpt-4o-mini"),
SentenceType=(SubjectVerbObjectSentence, SubjectVerbSentence)
)
# Input: Complex English sentence
result = translator("The dog is sitting at the lakeside, drinking some water.")
# Output: Grammatically valid OVP sentence(s)
print(f"OVP: {result.target}")
print(f"Back-translation: {result.back_translation.source}")
Behind the scenes, the LLM performs sentence segmentation: breaking down the input into simple SV/SVO structures that match our defined sentence types. Each segment is then validated against the Pydantic schema, ensuring every generated sentence is well-formed according to OVP's grammatical rules.
Why This Matters for Endangered Languages
This correctness-first approach is particularly crucial for endangered and no-resource languages because:
- No parallel data required: The system works with only a lexicon and grammatical rules --- no bilingual corpus needed
- Guaranteed grammatical validity: Every output is structurally correct by construction
- Suitable for language learning: Learners can trust the grammatical correctness of generated sentences
- Extensible: Adding new vocabulary or grammatical patterns is straightforward
- Transparent: The structured intermediate representation is human-readable and debuggable
Learn More
For more information including the evaluation methodology and empirical results demonstrating this approach, please read our paper:
๐ LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages
Currently Supported Languages
Owens Valley Paiute (OVP)
Yaduha includes a complete implementation for Owens Valley Paiute, a Uto-Aztecan language:
- 37 nouns (coyote, dog, water, mountain, etc.)
- 35 verbs (14 transitive, 21 intransitive)
- Full pronoun system (person, number, proximity, inclusivity)
- Tense/aspect system (6 tenses: past simple/continuous, present simple/continuous/perfect, future)
- Complex morphology (fortis/lenis consonant mutation, proximity-based suffixes)
Sentence structures:
- Subject-Verb: "I sleep" โ "nรผรผ รผwi-dรผ"
- Subject-Verb-Object: "You read the mountains" โ "รผรผ toyabi-noka ui-nia-dรผ"
Examples
Example 1: Basic Translation
from yaduha.translator.pipeline import PipelineTranslator
from yaduha.agent.openai import OpenAIAgent
from yaduha.language.ovp import SubjectVerbObjectSentence
import os
translator = PipelineTranslator(
agent=OpenAIAgent(
model="gpt-4o-mini",
api_key=os.environ["OPENAI_API_KEY"]
),
SentenceType=SubjectVerbObjectSentence
)
result = translator("The cat drinks water")
print(f"English: {result.source}")
print(f"OVP: {result.target}")
print(f"Verification: {result.back_translation.source}")
Example 2: Custom Tools
from yaduha.translator.agentic import AgenticTranslator
from yaduha.tool import Tool
import requests
class DictionaryTool(Tool):
name: ClassVar[str] = "dictionary_lookup"
description: ClassVar[str] = "Look up word translations"
def _run(self, word: str) -> List[Dict]:
response = requests.get(f"https://api.example.com/lookup?word={word}")
return response.json()
translator = AgenticTranslator(
agent=agent,
tools=[DictionaryTool()]
)
result = translator("How do you say 'hello' in Paiute?")
print(result.target)
print(f"Confidence: {result.metadata['confidence_level']}")
Example 3: Token Tracking
result = translator("A complex sentence to translate")
print(f"Translation time: {result.translation_time:.2f}s")
print(f"Forward tokens: {result.prompt_tokens + result.completion_tokens}")
print(f"Back-translation tokens: {result.back_translation.prompt_tokens + result.back_translation.completion_tokens}")
print(f"Total cost (approx): ${(result.prompt_tokens * 0.15 + result.completion_tokens * 0.60) / 1_000_000:.4f}")
Documentation
- Getting Started Guide
- Architecture Overview
- API Reference
- Creating Custom Languages
- Building Tools
- Examples & Tutorials
Project Structure
yaduha-2/
โโโ yaduha/ # Main package
โ โโโ agent/ # AI agent abstraction
โ โ โโโ __init__.py
โ โ โโโ openai.py
โ โโโ tool/ # Tool framework
โ โ โโโ __init__.py
โ โ โโโ english_to_sentences.py
โ โ โโโ sentence_to_english.py
โ โโโ translator/ # Translation strategies
โ โ โโโ __init__.py
โ โ โโโ pipeline.py
โ โ โโโ agentic.py
โ โโโ language/ # Language implementations
โ โโโ __init__.py
โ โโโ ovp/ # Owens Valley Paiute
โ โโโ __init__.py
โ โโโ vocab.py
โ โโโ prompts.py
โโโ scripts/ # Example scripts
โโโ docs/ # Documentation
โโโ setup.py
Development
Please find the Development Documentation at docs/index.md.
Running Tests
# Test pipeline translator
python scripts/test_pipeline_translator.py
# Test agentic translator
python scripts/test_agentic_translator.py
# Test agent functionality
python scripts/test_agent.py
Citation
If you use Yaduha in your research, please cite:
@software{yaduha2024,
title={Yaduha: A Type-Safe Framework for Structured Language Translation},
author={[Your Name]},
year={2024},
url={https://github.com/[your-username]/yaduha}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yaduha-0.3.tar.gz.
File metadata
- Download URL: yaduha-0.3.tar.gz
- Upload date:
- Size: 40.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb6c3ee63b97e07de3ce664bd435e55befaedb57d4dbf90f50e3ef32f2776c49
|
|
| MD5 |
18c9f2ed1da06bea66e95835c8c991df
|
|
| BLAKE2b-256 |
aec8c15c603caf6c40b3c1d9c5b214e89b6356fc7619367db6571c1bef5936ed
|
File details
Details for the file yaduha-0.3-py3-none-any.whl.
File metadata
- Download URL: yaduha-0.3-py3-none-any.whl
- Upload date:
- Size: 42.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7ecd89b9eb88c2cab451715b988a5465c1d03bce90a541646ac0850e9380782
|
|
| MD5 |
4d8ca93d53d7df3feb5e1e8a53172762
|
|
| BLAKE2b-256 |
392145dc589217e9bec70e3c6768b39860e2b11cf3bba6ce39d852ed6e65b4f7
|