A dual Python/TypeScript library for Japanese text parsing and encoding using kotogram format

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jomof

These details have not been verified by PyPI

Project description

Kotogram

What is this?

Ever wondered if a Japanese sentence sounds too formal, or whether that sentence-ending particle makes it sound masculine? Kotogram is a lightweight NLP library that analyzes Japanese grammatical style, formality, gender markers, and register detection.

While excellent tools like MeCab and Sudachi focus on morphological analysis (breaking text into tokens and identifying parts of speech), Kotogram takes things a step further by analyzing the social and stylistic dimensions of Japanese text:

Formality: Is this casual banter or keigo? (And is it mixing them inappropriately?)
Gender: Does this use masculine (俺だぜ), feminine (〜わ), or neutral speech patterns?
Register: Kansai-ben? Internet slang? Honorific language? Military commands?
Grammaticality: Is this sentence well-formed, or a common learner mistake?

The whole thing runs on a compact 7MB neural model and works in both Python (for the ML inference) and TypeScript (for working with the kotogram format).

Quick Examples

Let's see it in action! The bin/kotogram grammar command analyzes any Japanese text:

Detecting Formality

$ bin/kotogram grammar "お疲れ様でございます"
{
  "kotogram": "⌈ˢお疲れ様ᵖnoun:common-noun:adjectival-noun-possibleʳオツカレサマ⌉⌈ˢでᵖaux-verb:aux-da:continuativeᵇだᵈだʳデ⌉⌈ˢございᵖverb:bound:godan-ra:continuative-i-euphonicᵇござるᵈござるʳゴザイ⌉⌈ˢますᵖaux-verb:aux-masu:terminalʳマス⌉",
  "formality": "formal",
  "formality_score": 0.5010958909988403,
  "formality_is_pragmatic": true,
  "gender": "neutral",
  "gender_score": 0.0007681779679842293,
  "gender_is_pragmatic": true,
  "registers": [
    "neutral"
  ],
  "register_scores": {
    "neutral": 0.9213598966598511
  },
  "is_grammatic": true,
  "grammaticality_score": 0.9999127388000488
}

The kotogram field in the output shows how the sentence gets internally represented. Here's what one token looks like when you break it down:

⌈ˢございᵖverb:bound:godan-ra:continuative-i-euphonicᵇござるᵈござるʳゴザイ⌉
  │  │     │                                        │      │      │
  │  │     │                                        │      │      └─ pronunciation (ʳ)
  │  │     │                                        │      └─ lemma (ᵈ)
  │  │     │                                        └─ base form (ᵇ)
  │  │     └─ part-of-speech + conjugation (ᵖ)
  │  └─ surface form (ˢ)
  └─ token boundaries (⌈⌉)

Pretty neat how much linguistic information we can pack into a compact format, right?

Gender Detection

$ bin/kotogram grammar "あら、素敵ですわ"
{
  "kotogram": "⌈ˢあらᵖinterj:generalʳアラ⌉⌈ˢ、ᵖaux-symbol:comma⌉⌈ˢ素敵ᵖadjectival-noun:generalʳステキ⌉⌈ˢですᵖaux-verb:aux-desu:terminalʳデス⌉⌈ˢわᵖparticle:sentence-final-particleʳワ⌉",
  "formality": "formal",
  "formality_score": 0.5490256547927856,
  "formality_is_pragmatic": true,
  "gender": "feminine",
  "gender_score": 0.9999998211860657,
  "gender_is_pragmatic": true,
  "registers": [
    "ojousama"
  ],
  "register_scores": {
    "ojousama": 0.9900707602500916
  },
  "is_grammatic": true,
  "grammaticality_score": 0.999970555305481
}

The model picks up on that sentence-final わ (wa) and correctly identifies this as ojousama-style speech (refined, upper-class feminine Japanese). The gender score of 0.9999998 means the model is extremely confident about the feminine markers.

Catching Subtle Awkwardness

Here's a more subtle issue — a sentence that's technically parseable but semantically awkward:

$ bin/kotogram grammar "大きくない小さい"
{
  "kotogram": "⌈ˢ大きくᵖadj:general:i-adjective:continuativeᵇ大きいᵈ大きいʳオオキク⌉⌈ˢないᵖadj:bound:i-adjective:terminalʳナイ⌉⌈ˢ小さいᵖadj:general:i-adjective:terminalʳチイサイ⌉",
  "formality": "neutral",
  "formality_score": -0.00582164479419589,
  "formality_is_pragmatic": true,
  "gender": "neutral",
  "gender_score": -0.0024029570631682873,
  "gender_is_pragmatic": true,
  "registers": [
    "neutral"
  ],
  "register_scores": {
    "neutral": 0.9790019989013672
  },
  "is_grammatic": false,
  "grammaticality_score": 0.1085873544216156
}

Why this is awkward: This literally means "not-big small" — grammatically parseable, but semantically redundant. While you can stack adjectives in Japanese, saying "not big small" is unnatural because 小さい (chiisai, small) already implies "not big."

Japanese highly values concision (簡潔さ). The natural way to express this would be simply:

Concise: 小さい (chiisai) — "small"
Or with emphasis: 大きくない (ookikunai) — "not big"

This kind of redundant negation occasionally appears in learner speech when they're trying to be emphatic but end up being unnecessarily verbose. The model's grammaticality score of 0.108 (pretty low, but not zero) reflects that while the syntax parses, the semantic redundancy makes it sound distinctly non-native.

Detecting Unpragmatic Mixing

Here's an interesting one — a sentence that's grammatically parseable but stylistically bizarre:

$ bin/kotogram grammar "食べたんだぜです"
{
  "kotogram": "⌈ˢ食べᵖverb:general:lower-ichidan-ba:continuativeᵇ食べるᵈ食べるʳタベ⌉⌈ˢたᵖaux-verb:aux-ta:attributiveʳタ⌉⌈ˢんᵖparticle:nominal-particleʳン⌉⌈ˢだᵖaux-verb:aux-da:terminalʳダ⌉⌈ˢぜᵖparticle:sentence-final-particleʳゼ⌉⌈ˢですᵖaux-verb:aux-desu:terminalʳデス⌉",
  "formality": "unpragmatic_formality",
  "formality_score": 0.3184594213962555,
  "formality_is_pragmatic": false,
  "gender": "masculine",
  "gender_score": -0.9999995827674866,
  "gender_is_pragmatic": true,
  "registers": [
    "danseigo"
  ],
  "register_scores": {
    "danseigo": 0.9998853206634521
  },
  "is_grammatic": false,
  "grammaticality_score": 2.01202964879299e-12
}

Why is this unpragmatic? It mixes ぜ (ze, a rough masculine sentence-ender) with です (desu, formal copula). In Japanese, you need to pick a formality register and stick with it throughout the sentence. This would sound as jarring to a native speaker as mixing "ain't" with "indeed" in English.

Correct versions:

Casual masculine: 食べたんだぜ (tabetan da ze) — "I ate, y'know!" (rough)
Formal neutral: 食べたんです (tabetan desu) — "I ate." (polite)

Installation & Usage

Python

pip install kotogram

from kotogram import SudachiJapaneseParser, grammar

# Parse Japanese to kotogram format
parser = SudachiJapaneseParser()
text = "お疲れ様でございます"
kotogram_str = parser.japanese_to_kotogram(text)

# Analyze the grammar
analysis = grammar(kotogram_str)

print(f"Formality: {analysis.formality}")
print(f"Gender: {analysis.gender}")
print(f"Registers: {analysis.registers}")
print(f"Grammatic? {analysis.is_grammatic}")
print(f"Grammaticality confidence: {analysis.grammaticality_score:.4f}")

You can also work with kotograms directly:

from kotogram import kotogram_to_japanese, split_kotogram

# Convert back to readable Japanese
japanese = kotogram_to_japanese(kotogram_str)

# Add furigana readings (great for learners!)
with_furigana = kotogram_to_japanese(kotogram_str, furigana=True)
# Output: "お疲れ様[おつかれさま]で御座います[ございます]"

# Split into tokens for detailed analysis
tokens = split_kotogram(kotogram_str)

TypeScript

npm install kotogram

import { kotogramToJapanese, splitKotogram } from 'kotogram';

// Work with pre-computed kotograms (Python handles the parsing)
const kotogram = "⌈ˢ猫ᵖnoun:common-nounʳネコ⌉⌈ˢをᵖparticle:case-particleʳヲ⌉...";

// Convert to Japanese
const japanese = kotogramToJapanese(kotogram);
console.log(japanese);  // "猫を食べる"

// Add furigana
const withFurigana = kotogramToJapanese(kotogram, { furigana: true });
console.log(withFurigana);  // "猫[ねこ]を食べる[たべる]"

// Split into tokens
const tokens = splitKotogram(kotogram);

How It Works

The core of Kotogram is a compact transformer-based neural model (only 7MB!) trained on a carefully curated dataset. Rather than feeding it raw text, we use the kotogram representation — a structured format that explicitly encodes morphological features like POS tags, conjugation forms, and lemmas.

Why this approach?

By working with structured linguistic features instead of raw characters, the model can learn meaningful patterns from relatively small amounts of data. Think of it like the difference between learning grammar rules versus memorizing every possible sentence.

Training data:

~265K grammatic sentences with formality/gender labels (applied via heuristics)
1,115 hand-curated register examples across 13 categories (sonkeigo, kenjogo, dialects, internet slang, etc.)
~593K agrammatic examples for error detection
~270K unpragmatic examples showing inappropriate formality/gender mixing

What the model learns:

Formality as a continuous scale (-1.0 = very casual → +1.0 = very formal)
Gender as a continuous scale (-1.0 = masculine → +1.0 = feminine)
Register detection as a multi-label problem (sentences can have multiple registers!)
Grammaticality as binary classification
Pragmatic consistency — does this sentence maintain appropriate formality/gender?

The architecture uses multi-head attention over linguistic feature embeddings, trained with AdamW and cosine annealing — pretty standard modern NLP techniques, but applied to a focused domain-specific problem.

Design Philosophy

I built Kotogram around the idea that domain knowledge + efficient models > massive pre-training. Instead of throwing a huge transformer at raw text, we leverage what we know about Japanese linguistics to create structured representations that make the learning problem tractable.

Benefits:

Fast: < 10ms inference on CPU for typical sentences
Lightweight: 7MB model fits easily in web apps, mobile apps, serverless functions
Interpretable: Feature-based representations make it easier to debug and understand predictions

Citation

If you use Kotogram in your research or project, feel free to cite:

@software{kotogram2024,
  author = {Fisher, Jomo},
  title = {Kotogram: A Lightweight Japanese NLP Library for Grammar Analysis},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/jomof/kotogram}
}

Contributing

This started as a weekend project to explore Japanese linguistics and small-scale NLP. If you're interested in Japanese grammar, machine learning, or both — I'd love to hear from you! Feel free to open issues, submit PRs, or just say hi.

License

MIT — use it for whatever you like!

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jomof

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.22

Dec 23, 2025

0.0.21

Dec 23, 2025

0.0.20

Dec 22, 2025

0.0.18

Dec 21, 2025

0.0.17

Dec 21, 2025

0.0.16

Dec 18, 2025

0.0.15

Dec 18, 2025

0.0.14

Dec 17, 2025

0.0.13

Dec 17, 2025

0.0.12

Dec 13, 2025

0.0.11

Dec 13, 2025

0.0.10

Dec 10, 2025

0.0.9

Dec 10, 2025

0.0.8

Dec 10, 2025

0.0.7

Dec 10, 2025

0.0.6

Dec 10, 2025

0.0.5

Dec 10, 2025

0.0.4

Dec 10, 2025

0.0.3

Dec 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kotogram-0.0.22.tar.gz (6.6 MB view details)

Uploaded Dec 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kotogram-0.0.22-py3-none-any.whl (6.6 MB view details)

Uploaded Dec 23, 2025 Python 3

File details

Details for the file kotogram-0.0.22.tar.gz.

File metadata

Download URL: kotogram-0.0.22.tar.gz
Upload date: Dec 23, 2025
Size: 6.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kotogram-0.0.22.tar.gz
Algorithm	Hash digest
SHA256	`592301649cad5a9497d541303e20afcf82c437bd6d6ed0868418f997bc6717f0`
MD5	`b03b620d53e196c980638c165f1c552a`
BLAKE2b-256	`81c9973e4b756dc9bdfa72b1c226e800147bde9585ca6c9e7b98b1294392eff7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kotogram-0.0.22.tar.gz:

Publisher: python_publish.yml on jomof/kotogram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kotogram-0.0.22.tar.gz
- Subject digest: 592301649cad5a9497d541303e20afcf82c437bd6d6ed0868418f997bc6717f0
- Sigstore transparency entry: 776172631
- Sigstore integration time: Dec 23, 2025
Source repository:
- Permalink: jomof/kotogram@ff42469392588016102164910635184080c5af1c
- Branch / Tag: refs/tags/v0.0.22
- Owner: https://github.com/jomof
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python_publish.yml@ff42469392588016102164910635184080c5af1c
- Trigger Event: push

File details

Details for the file kotogram-0.0.22-py3-none-any.whl.

File metadata

Download URL: kotogram-0.0.22-py3-none-any.whl
Upload date: Dec 23, 2025
Size: 6.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kotogram-0.0.22-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ac1ee49f291185a3693a34a04fb51092eb7f8739599b2276be9656d9db7c83f1`
MD5	`f720ecd691cb495090e23ea2656629bc`
BLAKE2b-256	`e56450ca96e7cf86ae22f74de38abb9fb4539367faba783103b9137b6ccac901`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kotogram-0.0.22-py3-none-any.whl:

Publisher: python_publish.yml on jomof/kotogram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kotogram-0.0.22-py3-none-any.whl
- Subject digest: ac1ee49f291185a3693a34a04fb51092eb7f8739599b2276be9656d9db7c83f1
- Sigstore transparency entry: 776172637
- Sigstore integration time: Dec 23, 2025
Source repository:
- Permalink: jomof/kotogram@ff42469392588016102164910635184080c5af1c
- Branch / Tag: refs/tags/v0.0.22
- Owner: https://github.com/jomof
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python_publish.yml@ff42469392588016102164910635184080c5af1c
- Trigger Event: push

kotogram 0.0.22

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Kotogram

What is this?

Quick Examples

Detecting Formality

Gender Detection

Catching Subtle Awkwardness

Detecting Unpragmatic Mixing

Installation & Usage

Python

TypeScript

How It Works

Why this approach?

Design Philosophy

Citation

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance