Skip to main content

A dual Python/TypeScript library for Japanese text parsing and encoding using kotogram format

Project description

Kotogram

Python Canary TypeScript Canary PyPI Version npm Version Python Support License

A dual Python/TypeScript library for Japanese text parsing and encoding using the kotogram compact format.

Overview

Kotogram provides tools for parsing Japanese text into a compact, linguistically-rich format that encodes part-of-speech, conjugation, and pronunciation information. The library features:

  • Abstract parser interface (JapaneseParser) for multiple backend implementations
  • MeCab implementation (MecabJapaneseParser) using UniDic dictionary
  • Kotogram format - compact representation preserving linguistic features
  • Bidirectional conversion between Japanese text and kotogram format
  • Dual-language support - Python and TypeScript implementations (TypeScript coming soon)
  • Production-quality CI/CD with comprehensive testing and publishing workflows

Project Structure

kotogram/
├── kotogram/                    # Python package
│   ├── __init__.py             # Package exports and version
│   ├── japanese_parser.py      # Abstract JapaneseParser interface
│   └── mecab_japanese_parser.py # MeCab implementation
├── src/                         # TypeScript source
│   ├── kotogram.ts             # Kotogram conversion functions
│   └── index.ts                # Package exports
├── tests-py/                    # Python tests
│   └── test_japanese_parser.py # Japanese parser tests
├── tests-ts/                    # TypeScript tests
│   └── kotogram.test.ts
├── .github/workflows/           # CI/CD workflows
│   ├── python_canary.yml       # Python build & test
│   ├── typescript_canary.yml   # TypeScript build & test
│   ├── python_publish.yml      # Publish to PyPI
│   └── typescript_publish.yml  # Publish to npm
├── version.txt                  # Single source of truth for version
├── publish.sh                  # Version bump and publish script
├── pyproject.toml              # Python package configuration
├── package.json                # TypeScript package configuration
└── tsconfig.json               # TypeScript compiler configuration

Quick Start

Japanese Text Parsing

Parse Japanese text into kotogram format with full linguistic information:

Python:

from kotogram import MecabJapaneseParser, kotogram_to_japanese

# Initialize parser (requires MeCab and unidic)
parser = MecabJapaneseParser()

# Convert Japanese to kotogram
japanese = "猫を食べる"
kotogram = parser.japanese_to_kotogram(japanese)
# Result: ⌈ˢ猫ᵖn:common_noun⌉⌈ˢをᵖprt:case_particle⌉⌈ˢ食べるᵖv:general:e-ichidan-ba:terminal⌉

# Convert back to Japanese
reconstructed = kotogram_to_japanese(kotogram)
# Result: "猫を食べる"

# With spaces between tokens
spaced = kotogram_to_japanese(kotogram, spaces=True)
# Result: "猫 を 食べる"

# With furigana (IME-style readings in brackets)
with_furigana = kotogram_to_japanese(kotogram, furigana=True)
# Result: "猫[ねこ]を食べる[たべる]"

# Combine options
spaced_furigana = kotogram_to_japanese(kotogram, spaces=True, furigana=True)
# Result: "猫[ねこ] を 食べる[たべる]"

TypeScript:

import { kotogramToJapanese, splitKotogram } from 'kotogram';

// Convert Japanese to kotogram (requires Python parser)
const kotogram = "⌈ˢ猫ᵖn:common_noun⌉⌈ˢをᵖprt:case_particle⌉⌈ˢ食べるᵖv:general:e-ichidan-ba:terminal⌉";

// Convert back to Japanese
const reconstructed = kotogramToJapanese(kotogram);
// Result: "猫を食べる"

// With spaces between tokens
const spaced = kotogramToJapanese(kotogram, { spaces: true });
// Result: "猫 を 食べる"

// With furigana (IME-style readings in brackets)
const withFurigana = kotogramToJapanese(kotogram, { furigana: true });
// Result: "猫[ねこ]を食べる[たべる]"

// Split into tokens
const tokens = splitKotogram(kotogram);
// Result: ["⌈ˢ猫ᵖn:common_noun⌉", "⌈ˢをᵖprt:case_particle⌉", "⌈ˢ食べるᵖv:general:e-ichidan-ba:terminal⌉"]

Kotogram Format

The kotogram format encodes rich linguistic information in a compact representation:

⌈ˢ食べるᵖv:general:e-ichidan-ba:terminalᵇ食べるᵈ食べるʳタベル⌉
  │  │    │ │       │            │         │      │      │
  │  │    │ │       │            │         │      │      └─ pronunciation (ʳ)
  │  │    │ │       │            │         │      └─ lemma (ᵈ)
  │  │    │ │       │            │         └─ base form (ᵇ)
  │  │    │ │       │            └─ conjugation form
  │  │    │ │       └─ conjugation type
  │  │    │ └─ POS detail
  │  │    └─ part-of-speech (ᵖ)
  │  └─ surface form (ˢ)
  └─ token boundary markers (⌈⌉)

Development

Python Development

# Install in development mode
pip install -e .

# Run tests
python -m pytest tests-py/

# Run type checking
mypy kotogram/

# Build package
python -m build

TypeScript Development

# Install dependencies
npm install

# Build
npm run build

# Run tests
npm test

# Type check
npx tsc --noEmit

Testing

Python Tests

Tests are located in tests-py/ and use the unittest framework. They are also compatible with pytest.

Run tests:

python -m unittest discover -s tests-py -p 'test_*.py' -v
# or
python -m pytest tests-py/ -v

TypeScript Tests

Tests are located in tests-ts/ and use Node.js built-in test runner.

Run tests:

npm test

GitHub Workflows

Canary Builds

These workflows run on every push, pull request, and daily at 2 AM UTC:

  • .github/workflows/python_canary.yml

    • Testing: Runs on Python 3.8, 3.9, 3.10, 3.11, 3.12 with unittest and pytest
    • Code Coverage: Tracks test coverage and uploads to Codecov
    • Code Quality:
      • Black for code formatting
      • isort for import sorting
      • flake8 for linting (complexity limit: 10)
      • pylint for advanced code quality (minimum score: 8.0)
      • mypy for strict type checking
    • Security:
      • bandit for security vulnerability scanning
      • safety for dependency vulnerability checks
    • Best Practices:
      • Checks for print() statements (should use logging)
      • Detects TODO/FIXME comments
      • Validates README.md and LICENSE files exist
    • Package Validation:
      • Ensures no TypeScript/JavaScript files leak into Python package
      • Verifies package contents and structure
  • .github/workflows/typescript_canary.yml

    • Testing: Runs on Node.js 18, 20, 22
    • Type Checking: Strict TypeScript type checking with --noEmit
    • Code Quality:
      • ESLint for linting (if configured)
      • Prettier for code formatting (if configured)
      • Circular dependency detection with madge
    • Performance:
      • Bundle size analysis (warns if >100KB)
    • Security:
      • npm audit for dependency vulnerabilities
    • Best Practices:
      • Checks for console.log() statements
      • Detects TODO/FIXME comments
      • Warns about any types (encourages type safety)
      • Validates package.json metadata (description, keywords, repository, license)
      • Validates README.md and LICENSE files exist
    • Package Validation:
      • Ensures no Python files leak into TypeScript package
      • Verifies dist/ directory contents

Publishing Workflows

These workflows are triggered when a version tag (e.g., v0.0.1) is pushed:

Version Management

Single Source of Truth

The file version.txt contains the current version number (e.g., 0.0.1). This version must be kept in sync across:

The publish workflows automatically verify this consistency before publishing.

Publishing a New Version

Use the publish.sh script to bump the version and trigger publication:

# Bump patch version (0.0.1 -> 0.0.2)
./publish.sh patch

# Bump minor version (0.0.1 -> 0.1.0)
./publish.sh minor

# Bump major version (0.0.1 -> 1.0.0)
./publish.sh major

The script will:

  1. Increment the version number
  2. Update all version files
  3. Commit the changes
  4. Create a git tag (e.g., v0.0.2)
  5. Push the commit and tag to GitHub

This triggers both python_publish.yml and typescript_publish.yml workflows.

Badges

The README includes status badges for build status, package versions, and license:

[![Python Canary](https://github.com/jomof/kotogram/actions/workflows/python_canary.yml/badge.svg?branch=main)](https://github.com/jomof/kotogram/actions/workflows/python_canary.yml)
[![TypeScript Canary](https://github.com/jomof/kotogram/actions/workflows/typescript_canary.yml/badge.svg?branch=main)](https://github.com/jomof/kotogram/actions/workflows/typescript_canary.yml)
[![PyPI Version](https://img.shields.io/pypi/v/kotogram.svg)](https://pypi.org/project/kotogram/)
[![npm Version](https://img.shields.io/npm/v/kotogram.svg)](https://www.npmjs.com/package/kotogram)
[![Python Support](https://img.shields.io/pypi/pyversions/kotogram.svg)](https://pypi.org/project/kotogram/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

Note: Update the username in badge URLs if you fork this to your own repository.

Configuration Requirements

PyPI Publishing

To publish to PyPI, configure trusted publishing:

  1. Go to PyPI → Your Account → Publishing
  2. Add a new publisher with:
    • Repository: jomof/kotogram
    • Workflow: python_publish.yml
    • Environment: pypi

npm Publishing

To publish to npm, you need an npm access token:

  1. Create an automation token on npmjs.com
  2. Add it as a GitHub secret named NPM_TOKEN
  3. Configure the npm environment in your repository settings

API Reference

JapaneseParser (Abstract Base Class)

Abstract interface for Japanese text parsing implementations.

from kotogram import JapaneseParser

class JapaneseParser(ABC):
    @abstractmethod
    def japanese_to_kotogram(self, text: str) -> str:
        """Convert Japanese text to kotogram compact representation."""
        pass

MecabJapaneseParser

MeCab-based implementation using the UniDic dictionary.

from kotogram import MecabJapaneseParser

# Initialize with default settings
parser = MecabJapaneseParser()

# Or provide your own MeCab tagger instance
import MeCab
tagger = MeCab.Tagger('-d /path/to/unidic')
parser = MecabJapaneseParser(mecab_tagger=tagger)

# Enable validation mode for debugging unmapped features
parser_strict = MecabJapaneseParser(validate=True)
# This will raise descriptive KeyError if any MeCab features
# are missing from the mapping dictionaries

# Parse Japanese text
kotogram = parser.japanese_to_kotogram("今日は良い天気です")

Parameters:

  • mecab_tagger (optional): Pre-configured MeCab tagger instance
  • validate (default: False): When True, raises descriptive KeyError exceptions when encountering unmapped linguistic features. The error message includes:
    • The name of the mapping dictionary (e.g., POS_MAP, CONJUGATED_TYPE_MAP)
    • The unmapped key value
    • The raw MeCab token line for context

Validation Mode Example:

# With validate=True, unmapped features raise detailed errors
parser = MecabJapaneseParser(validate=True)
try:
    kotogram = parser.japanese_to_kotogram("未知の単語")
except KeyError as e:
    # Error message: "Missing mapping in POS_MAP: key='未知品詞' not found.
    #                 Raw MeCab token: 未知の単語\t未知品詞,..."
    print(f"Unmapped feature detected: {e}")

Helper Functions

from kotogram import kotogram_to_japanese, split_kotogram

# Convert kotogram back to Japanese
japanese = kotogram_to_japanese(kotogram_str)
japanese_with_spaces = kotogram_to_japanese(kotogram_str, spaces=True)

# Split kotogram into individual tokens
tokens = split_kotogram(kotogram_str)

Mapping Constants

Global mapping constants are available in japanese_parser module:

from kotogram.japanese_parser import (
    POS_MAP,              # Part-of-speech mappings
    POS1_MAP,             # POS detail level 1
    POS2_MAP,             # POS detail level 2
    CONJUGATED_TYPE_MAP,  # Conjugation type mappings
    CONJUGATED_FORM_MAP,  # Conjugation form mappings
    POS_TO_CHARS,         # POS to character mappings
    CHAR_TO_POS,          # Character to POS mappings
)

License

MIT

Contributing

This is a template project. Feel free to fork and adapt it for your own dual-language libraries!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kotogram-0.0.5.tar.gz (48.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kotogram-0.0.5-py3-none-any.whl (46.4 kB view details)

Uploaded Python 3

File details

Details for the file kotogram-0.0.5.tar.gz.

File metadata

  • Download URL: kotogram-0.0.5.tar.gz
  • Upload date:
  • Size: 48.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kotogram-0.0.5.tar.gz
Algorithm Hash digest
SHA256 2bef022cf41690ca072e436b05fde26a2365c3280327a396f32e1cf228a68e0c
MD5 3a358b1a77683b91726074c22cc83122
BLAKE2b-256 fbb8f98ec2d8fa5f3701004e14647ada5d1c7bc69a2e0d257a9d775520bfde9a

See more details on using hashes here.

Provenance

The following attestation bundles were made for kotogram-0.0.5.tar.gz:

Publisher: python_publish.yml on jomof/kotogram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kotogram-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: kotogram-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 46.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kotogram-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b3f1715870c94d6670b90a4610e8113ce9c65680b410f9ff828cf2881f28e107
MD5 fe606a52a527f8b1e7b95b3f6b501613
BLAKE2b-256 d93bb08b02af93b8c7392ce404ef5b7d9339d9f745b633e49bd3b93b95b41b5b

See more details on using hashes here.

Provenance

The following attestation bundles were made for kotogram-0.0.5-py3-none-any.whl:

Publisher: python_publish.yml on jomof/kotogram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page