A dual Python/TypeScript library for Japanese text parsing and encoding using kotogram format
Project description
Kotogram
What is this?
Ever wondered if a Japanese sentence sounds too formal, or whether that sentence-ending particle makes it sound masculine? Kotogram is a lightweight NLP library that analyzes Japanese grammatical style, formality, gender markers, and register detection.
While excellent tools like MeCab and Sudachi focus on morphological analysis (breaking text into tokens and identifying parts of speech), Kotogram takes things a step further by analyzing the social and stylistic dimensions of Japanese text:
- Formality: Is this casual banter or keigo? (And is it mixing them inappropriately?)
- Gender: Does this use masculine (俺だぜ), feminine (〜わ), or neutral speech patterns?
- Register: Kansai-ben? Internet slang? Honorific language? Military commands?
- Grammaticality: Is this sentence well-formed, or a common learner mistake?
The whole thing runs on a compact 7MB neural model and works in both Python (for the ML inference) and TypeScript (for working with the kotogram format).
Quick Examples
Let's see it in action! The bin/kotogram grammar command analyzes any Japanese text:
Detecting Formality
$ bin/kotogram grammar "お疲れ様でございます"
{
"kotogram": "⌈ˢお疲れ様ᵖnoun:common-noun:adjectival-noun-possibleʳオツカレサマ⌉⌈ˢでᵖaux-verb:aux-da:continuativeᵇだᵈだʳデ⌉⌈ˢございᵖverb:bound:godan-ra:continuative-i-euphonicᵇござるᵈござるʳゴザイ⌉⌈ˢますᵖaux-verb:aux-masu:terminalʳマス⌉",
"formality": "formal",
"formality_score": 0.5010958909988403,
"formality_is_pragmatic": true,
"gender": "neutral",
"gender_score": 0.0007681779679842293,
"gender_is_pragmatic": true,
"registers": [
"neutral"
],
"register_scores": {
"neutral": 0.9213598966598511
},
"is_grammatic": true,
"grammaticality_score": 0.9999127388000488
}
The kotogram field in the output shows how the sentence gets internally represented. Here's what one token looks like when you break it down:
⌈ˢございᵖverb:bound:godan-ra:continuative-i-euphonicᵇござるᵈござるʳゴザイ⌉
│ │ │ │ │ │
│ │ │ │ │ └─ pronunciation (ʳ)
│ │ │ │ └─ lemma (ᵈ)
│ │ │ └─ base form (ᵇ)
│ │ └─ part-of-speech + conjugation (ᵖ)
│ └─ surface form (ˢ)
└─ token boundaries (⌈⌉)
Pretty neat how much linguistic information we can pack into a compact format, right?
Gender Detection
$ bin/kotogram grammar "あら、素敵ですわ"
{
"kotogram": "⌈ˢあらᵖinterj:generalʳアラ⌉⌈ˢ、ᵖaux-symbol:comma⌉⌈ˢ素敵ᵖadjectival-noun:generalʳステキ⌉⌈ˢですᵖaux-verb:aux-desu:terminalʳデス⌉⌈ˢわᵖparticle:sentence-final-particleʳワ⌉",
"formality": "formal",
"formality_score": 0.5490256547927856,
"formality_is_pragmatic": true,
"gender": "feminine",
"gender_score": 0.9999998211860657,
"gender_is_pragmatic": true,
"registers": [
"ojousama"
],
"register_scores": {
"ojousama": 0.9900707602500916
},
"is_grammatic": true,
"grammaticality_score": 0.999970555305481
}
The model picks up on that sentence-final わ (wa) and correctly identifies this as ojousama-style speech (refined, upper-class feminine Japanese). The gender score of 0.9999998 means the model is extremely confident about the feminine markers.
Catching Subtle Awkwardness
Here's a more subtle issue — a sentence that's technically parseable but semantically awkward:
$ bin/kotogram grammar "大きくない小さい"
{
"kotogram": "⌈ˢ大きくᵖadj:general:i-adjective:continuativeᵇ大きいᵈ大きいʳオオキク⌉⌈ˢないᵖadj:bound:i-adjective:terminalʳナイ⌉⌈ˢ小さいᵖadj:general:i-adjective:terminalʳチイサイ⌉",
"formality": "neutral",
"formality_score": -0.00582164479419589,
"formality_is_pragmatic": true,
"gender": "neutral",
"gender_score": -0.0024029570631682873,
"gender_is_pragmatic": true,
"registers": [
"neutral"
],
"register_scores": {
"neutral": 0.9790019989013672
},
"is_grammatic": false,
"grammaticality_score": 0.1085873544216156
}
Why this is awkward: This literally means "not-big small" — grammatically parseable, but semantically redundant. While you can stack adjectives in Japanese, saying "not big small" is unnatural because 小さい (chiisai, small) already implies "not big."
Japanese highly values concision (簡潔さ). The natural way to express this would be simply:
- Concise: 小さい (chiisai) — "small"
- Or with emphasis: 大きくない (ookikunai) — "not big"
This kind of redundant negation occasionally appears in learner speech when they're trying to be emphatic but end up being unnecessarily verbose. The model's grammaticality score of 0.108 (pretty low, but not zero) reflects that while the syntax parses, the semantic redundancy makes it sound distinctly non-native.
Detecting Unpragmatic Mixing
Here's an interesting one — a sentence that's grammatically parseable but stylistically bizarre:
$ bin/kotogram grammar "食べたんだぜです"
{
"kotogram": "⌈ˢ食べᵖverb:general:lower-ichidan-ba:continuativeᵇ食べるᵈ食べるʳタベ⌉⌈ˢたᵖaux-verb:aux-ta:attributiveʳタ⌉⌈ˢんᵖparticle:nominal-particleʳン⌉⌈ˢだᵖaux-verb:aux-da:terminalʳダ⌉⌈ˢぜᵖparticle:sentence-final-particleʳゼ⌉⌈ˢですᵖaux-verb:aux-desu:terminalʳデス⌉",
"formality": "unpragmatic_formality",
"formality_score": 0.3184594213962555,
"formality_is_pragmatic": false,
"gender": "masculine",
"gender_score": -0.9999995827674866,
"gender_is_pragmatic": true,
"registers": [
"danseigo"
],
"register_scores": {
"danseigo": 0.9998853206634521
},
"is_grammatic": false,
"grammaticality_score": 2.01202964879299e-12
}
Why is this unpragmatic? It mixes ぜ (ze, a rough masculine sentence-ender) with です (desu, formal copula). In Japanese, you need to pick a formality register and stick with it throughout the sentence. This would sound as jarring to a native speaker as mixing "ain't" with "indeed" in English.
Correct versions:
- Casual masculine: 食べたんだぜ (tabetan da ze) — "I ate, y'know!" (rough)
- Formal neutral: 食べたんです (tabetan desu) — "I ate." (polite)
Installation & Usage
Python
pip install kotogram
from kotogram import SudachiJapaneseParser, grammar
# Parse Japanese to kotogram format
parser = SudachiJapaneseParser()
text = "お疲れ様でございます"
kotogram_str = parser.japanese_to_kotogram(text)
# Analyze the grammar
analysis = grammar(kotogram_str)
print(f"Formality: {analysis.formality}")
print(f"Gender: {analysis.gender}")
print(f"Registers: {analysis.registers}")
print(f"Grammatic? {analysis.is_grammatic}")
print(f"Grammaticality confidence: {analysis.grammaticality_score:.4f}")
You can also work with kotograms directly:
from kotogram import kotogram_to_japanese, split_kotogram
# Convert back to readable Japanese
japanese = kotogram_to_japanese(kotogram_str)
# Add furigana readings (great for learners!)
with_furigana = kotogram_to_japanese(kotogram_str, furigana=True)
# Output: "お疲れ様[おつかれさま]で御座います[ございます]"
# Split into tokens for detailed analysis
tokens = split_kotogram(kotogram_str)
TypeScript
npm install kotogram
import { kotogramToJapanese, splitKotogram } from 'kotogram';
// Work with pre-computed kotograms (Python handles the parsing)
const kotogram = "⌈ˢ猫ᵖnoun:common-nounʳネコ⌉⌈ˢをᵖparticle:case-particleʳヲ⌉...";
// Convert to Japanese
const japanese = kotogramToJapanese(kotogram);
console.log(japanese); // "猫を食べる"
// Add furigana
const withFurigana = kotogramToJapanese(kotogram, { furigana: true });
console.log(withFurigana); // "猫[ねこ]を食べる[たべる]"
// Split into tokens
const tokens = splitKotogram(kotogram);
How It Works
The core of Kotogram is a compact transformer-based neural model (only 7MB!) trained on a carefully curated dataset. Rather than feeding it raw text, we use the kotogram representation — a structured format that explicitly encodes morphological features like POS tags, conjugation forms, and lemmas.
Why this approach?
By working with structured linguistic features instead of raw characters, the model can learn meaningful patterns from relatively small amounts of data. Think of it like the difference between learning grammar rules versus memorizing every possible sentence.
Training data:
- ~265K grammatic sentences with formality/gender labels (applied via heuristics)
- 1,115 hand-curated register examples across 13 categories (sonkeigo, kenjogo, dialects, internet slang, etc.)
- ~593K agrammatic examples for error detection
- ~270K unpragmatic examples showing inappropriate formality/gender mixing
What the model learns:
- Formality as a continuous scale (-1.0 = very casual → +1.0 = very formal)
- Gender as a continuous scale (-1.0 = masculine → +1.0 = feminine)
- Register detection as a multi-label problem (sentences can have multiple registers!)
- Grammaticality as binary classification
- Pragmatic consistency — does this sentence maintain appropriate formality/gender?
The architecture uses multi-head attention over linguistic feature embeddings, trained with AdamW and cosine annealing — pretty standard modern NLP techniques, but applied to a focused domain-specific problem.
Design Philosophy
I built Kotogram around the idea that domain knowledge + efficient models > massive pre-training. Instead of throwing a huge transformer at raw text, we leverage what we know about Japanese linguistics to create structured representations that make the learning problem tractable.
Benefits:
- Fast: < 10ms inference on CPU for typical sentences
- Lightweight: 7MB model fits easily in web apps, mobile apps, serverless functions
- Interpretable: Feature-based representations make it easier to debug and understand predictions
Citation
If you use Kotogram in your research or project, feel free to cite:
@software{kotogram2024,
author = {Fisher, Jomo},
title = {Kotogram: A Lightweight Japanese NLP Library for Grammar Analysis},
year = {2024},
publisher = {GitHub},
url = {https://github.com/jomof/kotogram}
}
Contributing
This started as a weekend project to explore Japanese linguistics and small-scale NLP. If you're interested in Japanese grammar, machine learning, or both — I'd love to hear from you! Feel free to open issues, submit PRs, or just say hi.
License
MIT — use it for whatever you like!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kotogram-0.0.22.tar.gz.
File metadata
- Download URL: kotogram-0.0.22.tar.gz
- Upload date:
- Size: 6.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
592301649cad5a9497d541303e20afcf82c437bd6d6ed0868418f997bc6717f0
|
|
| MD5 |
b03b620d53e196c980638c165f1c552a
|
|
| BLAKE2b-256 |
81c9973e4b756dc9bdfa72b1c226e800147bde9585ca6c9e7b98b1294392eff7
|
Provenance
The following attestation bundles were made for kotogram-0.0.22.tar.gz:
Publisher:
python_publish.yml on jomof/kotogram
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kotogram-0.0.22.tar.gz -
Subject digest:
592301649cad5a9497d541303e20afcf82c437bd6d6ed0868418f997bc6717f0 - Sigstore transparency entry: 776172631
- Sigstore integration time:
-
Permalink:
jomof/kotogram@ff42469392588016102164910635184080c5af1c -
Branch / Tag:
refs/tags/v0.0.22 - Owner: https://github.com/jomof
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python_publish.yml@ff42469392588016102164910635184080c5af1c -
Trigger Event:
push
-
Statement type:
File details
Details for the file kotogram-0.0.22-py3-none-any.whl.
File metadata
- Download URL: kotogram-0.0.22-py3-none-any.whl
- Upload date:
- Size: 6.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac1ee49f291185a3693a34a04fb51092eb7f8739599b2276be9656d9db7c83f1
|
|
| MD5 |
f720ecd691cb495090e23ea2656629bc
|
|
| BLAKE2b-256 |
e56450ca96e7cf86ae22f74de38abb9fb4539367faba783103b9137b6ccac901
|
Provenance
The following attestation bundles were made for kotogram-0.0.22-py3-none-any.whl:
Publisher:
python_publish.yml on jomof/kotogram
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kotogram-0.0.22-py3-none-any.whl -
Subject digest:
ac1ee49f291185a3693a34a04fb51092eb7f8739599b2276be9656d9db7c83f1 - Sigstore transparency entry: 776172637
- Sigstore integration time:
-
Permalink:
jomof/kotogram@ff42469392588016102164910635184080c5af1c -
Branch / Tag:
refs/tags/v0.0.22 - Owner: https://github.com/jomof
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python_publish.yml@ff42469392588016102164910635184080c5af1c -
Trigger Event:
push
-
Statement type: