Skip to main content

A dual Python/TypeScript library for Japanese text parsing and encoding using kotogram format

Project description

Kotogram

Python Canary TypeScript Canary PyPI Version npm Version Python Support License

A dual Python/TypeScript library for Japanese text parsing and encoding using the kotogram compact format.

Overview

Kotogram provides tools for parsing Japanese text into a compact, linguistically-rich format that encodes part-of-speech, conjugation, and pronunciation information. The library features:

  • Abstract parser interface (JapaneseParser) for backend implementations
  • Sudachi implementation (SudachiJapaneseParser) using SudachiPy with full dictionary
  • Kotogram format - compact representation preserving linguistic features
  • Bidirectional conversion between Japanese text and kotogram format
  • Dual-language support - Python and TypeScript implementations (TypeScript coming soon)
  • Production-quality CI/CD with comprehensive testing and publishing workflows

Project Structure

kotogram/
├── kotogram/                    # Python package
│   ├── __init__.py             # Package exports and version
│   ├── japanese_parser.py      # Abstract JapaneseParser interface
│   └── sudachi_japanese_parser.py # Sudachi implementation
├── src/                         # TypeScript source
│   ├── kotogram.ts             # Kotogram conversion functions
│   └── index.ts                # Package exports
├── tests-py/                    # Python tests
│   └── test_japanese_parser.py # Japanese parser tests
├── tests-ts/                    # TypeScript tests
│   └── kotogram.test.ts
├── .github/workflows/           # CI/CD workflows
│   ├── python_canary.yml       # Python build & test
│   ├── typescript_canary.yml   # TypeScript build & test
│   ├── python_publish.yml      # Publish to PyPI
│   └── typescript_publish.yml  # Publish to npm
├── version.txt                  # Single source of truth for version
├── publish.sh                  # Version bump and publish script
├── pyproject.toml              # Python package configuration
├── package.json                # TypeScript package configuration
└── tsconfig.json               # TypeScript compiler configuration

Quick Start

Japanese Text Parsing

Parse Japanese text into kotogram format with full linguistic information:

Python:

from kotogram import SudachiJapaneseParser, kotogram_to_japanese

# Initialize parser (requires sudachipy and sudachidict_full)
parser = SudachiJapaneseParser(dict_type='full')

# Convert Japanese to kotogram
japanese = "猫を食べる"
kotogram = parser.japanese_to_kotogram(japanese)
# Result: ⌈ˢ猫ᵖn:common_noun⌉⌈ˢをᵖprt:case_particle⌉⌈ˢ食べるᵖv:general:e-ichidan-ba:terminal⌉

# Convert back to Japanese
reconstructed = kotogram_to_japanese(kotogram)
# Result: "猫を食べる"

# With spaces between tokens
spaced = kotogram_to_japanese(kotogram, spaces=True)
# Result: "猫 を 食べる"

# With furigana (IME-style readings in brackets)
with_furigana = kotogram_to_japanese(kotogram, furigana=True)
# Result: "猫[ねこ]を食べる[たべる]"

# Combine options
spaced_furigana = kotogram_to_japanese(kotogram, spaces=True, furigana=True)
# Result: "猫[ねこ] を 食べる[たべる]"

TypeScript:

import { kotogramToJapanese, splitKotogram } from 'kotogram';

// Convert Japanese to kotogram (requires Python parser)
const kotogram = "⌈ˢ猫ᵖn:common_noun⌉⌈ˢをᵖprt:case_particle⌉⌈ˢ食べるᵖv:general:e-ichidan-ba:terminal⌉";

// Convert back to Japanese
const reconstructed = kotogramToJapanese(kotogram);
// Result: "猫を食べる"

// With spaces between tokens
const spaced = kotogramToJapanese(kotogram, { spaces: true });
// Result: "猫 を 食べる"

// With furigana (IME-style readings in brackets)
const withFurigana = kotogramToJapanese(kotogram, { furigana: true });
// Result: "猫[ねこ]を食べる[たべる]"

// Split into tokens
const tokens = splitKotogram(kotogram);
// Result: ["⌈ˢ猫ᵖn:common_noun⌉", "⌈ˢをᵖprt:case_particle⌉", "⌈ˢ食べるᵖv:general:e-ichidan-ba:terminal⌉"]

Kotogram Format

The kotogram format encodes rich linguistic information in a compact representation:

⌈ˢ食べるᵖv:general:e-ichidan-ba:terminalᵇ食べるᵈ食べるʳタベル⌉
  │  │    │ │       │            │         │      │      │
  │  │    │ │       │            │         │      │      └─ pronunciation (ʳ)
  │  │    │ │       │            │         │      └─ lemma (ᵈ)
  │  │    │ │       │            │         └─ base form (ᵇ)
  │  │    │ │       │            └─ conjugation form
  │  │    │ │       └─ conjugation type
  │  │    │ └─ POS detail
  │  │    └─ part-of-speech (ᵖ)
  │  └─ surface form (ˢ)
  └─ token boundary markers (⌈⌉)

Development

Python Development

# Install in development mode
pip install -e .

# Run tests
python -m pytest tests-py/

# Run type checking
mypy kotogram/

# Build package
python -m build

TypeScript Development

# Install dependencies
npm install

# Build
npm run build

# Run tests
npm test

# Type check
npx tsc --noEmit

Testing

Python Tests

Tests are located in tests-py/ and use the unittest framework. They are also compatible with pytest.

Run tests:

python -m unittest discover -s tests-py -p 'test_*.py' -v
# or
python -m pytest tests-py/ -v

TypeScript Tests

Tests are located in tests-ts/ and use Node.js built-in test runner.

Run tests:

npm test

GitHub Workflows

Canary Builds

These workflows run on every push, pull request, and daily at 2 AM UTC:

  • .github/workflows/python_canary.yml

    • Testing: Runs on Python 3.8, 3.9, 3.10, 3.11, 3.12 with unittest and pytest
    • Code Coverage: Tracks test coverage and uploads to Codecov
    • Code Quality:
      • Black for code formatting
      • isort for import sorting
      • flake8 for linting (complexity limit: 10)
      • pylint for advanced code quality (minimum score: 8.0)
      • mypy for strict type checking
    • Security:
      • bandit for security vulnerability scanning
      • safety for dependency vulnerability checks
    • Best Practices:
      • Checks for print() statements (should use logging)
      • Detects TODO/FIXME comments
      • Validates README.md and LICENSE files exist
    • Package Validation:
      • Ensures no TypeScript/JavaScript files leak into Python package
      • Verifies package contents and structure
  • .github/workflows/typescript_canary.yml

    • Testing: Runs on Node.js 18, 20, 22
    • Type Checking: Strict TypeScript type checking with --noEmit
    • Code Quality:
      • ESLint for linting (if configured)
      • Prettier for code formatting (if configured)
      • Circular dependency detection with madge
    • Performance:
      • Bundle size analysis (warns if >100KB)
    • Security:
      • npm audit for dependency vulnerabilities
    • Best Practices:
      • Checks for console.log() statements
      • Detects TODO/FIXME comments
      • Warns about any types (encourages type safety)
      • Validates package.json metadata (description, keywords, repository, license)
      • Validates README.md and LICENSE files exist
    • Package Validation:
      • Ensures no Python files leak into TypeScript package
      • Verifies dist/ directory contents

Publishing Workflows

These workflows are triggered when a version tag (e.g., v0.0.1) is pushed:

Version Management

Single Source of Truth

The file version.txt contains the current version number (e.g., 0.0.1). This version must be kept in sync across:

The publish workflows automatically verify this consistency before publishing.

Publishing a New Version

Use the publish.sh script to bump the version and trigger publication:

# Bump patch version (0.0.1 -> 0.0.2)
./publish.sh patch

# Bump minor version (0.0.1 -> 0.1.0)
./publish.sh minor

# Bump major version (0.0.1 -> 1.0.0)
./publish.sh major

The script will:

  1. Increment the version number
  2. Update all version files
  3. Commit the changes
  4. Create a git tag (e.g., v0.0.2)
  5. Push the commit and tag to GitHub

This triggers both python_publish.yml and typescript_publish.yml workflows.

Badges

The README includes status badges for build status, package versions, and license:

[![Python Canary](https://github.com/jomof/kotogram/actions/workflows/python_canary.yml/badge.svg?branch=main)](https://github.com/jomof/kotogram/actions/workflows/python_canary.yml)
[![TypeScript Canary](https://github.com/jomof/kotogram/actions/workflows/typescript_canary.yml/badge.svg?branch=main)](https://github.com/jomof/kotogram/actions/workflows/typescript_canary.yml)
[![PyPI Version](https://img.shields.io/pypi/v/kotogram.svg)](https://pypi.org/project/kotogram/)
[![npm Version](https://img.shields.io/npm/v/kotogram.svg)](https://www.npmjs.com/package/kotogram)
[![Python Support](https://img.shields.io/pypi/pyversions/kotogram.svg)](https://pypi.org/project/kotogram/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

Note: Update the username in badge URLs if you fork this to your own repository.

Configuration Requirements

PyPI Publishing

To publish to PyPI, configure trusted publishing:

  1. Go to PyPI → Your Account → Publishing
  2. Add a new publisher with:
    • Repository: jomof/kotogram
    • Workflow: python_publish.yml
    • Environment: pypi

npm Publishing

To publish to npm, you need an npm access token:

  1. Create an automation token on npmjs.com
  2. Add it as a GitHub secret named NPM_TOKEN
  3. Configure the npm environment in your repository settings

API Reference

JapaneseParser (Abstract Base Class)

Abstract interface for Japanese text parsing implementations.

from kotogram import JapaneseParser

class JapaneseParser(ABC):
    @abstractmethod
    def japanese_to_kotogram(self, text: str) -> str:
        """Convert Japanese text to kotogram compact representation."""
        pass

SudachiJapaneseParser

Sudachi-based implementation using SudachiPy with the full dictionary.

from kotogram import SudachiJapaneseParser

# Initialize with full dictionary (recommended)
parser = SudachiJapaneseParser(dict_type='full')

# Or use smaller dictionaries for faster loading
parser_small = SudachiJapaneseParser(dict_type='small')
parser_core = SudachiJapaneseParser(dict_type='core')

# Enable validation mode for debugging unmapped features
parser_strict = SudachiJapaneseParser(dict_type='full', validate=True)
# This will raise descriptive KeyError if any Sudachi features
# are missing from the mapping dictionaries

# Parse Japanese text
kotogram = parser.japanese_to_kotogram("今日は良い天気です")

Parameters:

  • dict_type (default: 'full'): Dictionary type to use ('small', 'core', or 'full')
  • validate (default: False): When True, raises descriptive KeyError exceptions when encountering unmapped linguistic features. The error message includes:
    • The name of the mapping dictionary (e.g., POS_MAP, CONJUGATED_TYPE_MAP)
    • The unmapped key value

Validation Mode Example:

# With validate=True, unmapped features raise detailed errors
parser = SudachiJapaneseParser(dict_type='full', validate=True)
try:
    kotogram = parser.japanese_to_kotogram("未知の単語")
except KeyError as e:
    # Error message: "Missing mapping in POS_MAP: key='未知品詞' not found."
    print(f"Unmapped feature detected: {e}")

Helper Functions

from kotogram import kotogram_to_japanese, split_kotogram

# Convert kotogram back to Japanese
japanese = kotogram_to_japanese(kotogram_str)
japanese_with_spaces = kotogram_to_japanese(kotogram_str, spaces=True)

# Split kotogram into individual tokens
tokens = split_kotogram(kotogram_str)

Mapping Constants

Global mapping constants are available in japanese_parser module:

from kotogram.japanese_parser import (
    POS_MAP,              # Part-of-speech mappings
    POS1_MAP,             # POS detail level 1
    POS2_MAP,             # POS detail level 2
    CONJUGATED_TYPE_MAP,  # Conjugation type mappings
    CONJUGATED_FORM_MAP,  # Conjugation form mappings
    POS_TO_CHARS,         # POS to character mappings
    CHAR_TO_POS,          # Character to POS mappings
)

License

MIT

Contributing

This is a template project. Feel free to fork and adapt it for your own dual-language libraries!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kotogram-0.0.18.tar.gz (14.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kotogram-0.0.18-py3-none-any.whl (14.3 MB view details)

Uploaded Python 3

File details

Details for the file kotogram-0.0.18.tar.gz.

File metadata

  • Download URL: kotogram-0.0.18.tar.gz
  • Upload date:
  • Size: 14.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kotogram-0.0.18.tar.gz
Algorithm Hash digest
SHA256 c5ea744ecefa59d0fc3ac6918363e41ce0ef9c85eadf61e7347027985e5baafa
MD5 1e830cf5c0af05ebd7e23faf43e64866
BLAKE2b-256 57a6f73cdcbd90b17779479a3d01623b5aa0c1efb940116b3d90cc8ac5302fe2

See more details on using hashes here.

Provenance

The following attestation bundles were made for kotogram-0.0.18.tar.gz:

Publisher: python_publish.yml on jomof/kotogram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kotogram-0.0.18-py3-none-any.whl.

File metadata

  • Download URL: kotogram-0.0.18-py3-none-any.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kotogram-0.0.18-py3-none-any.whl
Algorithm Hash digest
SHA256 4218a72b1e94ad6788a5f7ed4cf249ef79cfcd189f3bec7148c232f0825b54ce
MD5 f35c6ae5fa44e91a8a92df7584af868d
BLAKE2b-256 e7af36d4d5c08f500e74a939ca09c78e3c5ecec7cac4bb4b9ca8939317fc9cc2

See more details on using hashes here.

Provenance

The following attestation bundles were made for kotogram-0.0.18-py3-none-any.whl:

Publisher: python_publish.yml on jomof/kotogram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page