A dual Python/TypeScript library for Japanese text parsing and encoding using kotogram format
Project description
Kotogram
A dual Python/TypeScript library for Japanese text parsing and encoding using the kotogram compact format.
Overview
Kotogram provides tools for parsing Japanese text into a compact, linguistically-rich format that encodes part-of-speech, conjugation, and pronunciation information. The library features:
- Abstract parser interface (
JapaneseParser) for multiple backend implementations - MeCab implementation (
MecabJapaneseParser) using UniDic dictionary - Kotogram format - compact representation preserving linguistic features
- Bidirectional conversion between Japanese text and kotogram format
- Dual-language support - Python and TypeScript implementations (TypeScript coming soon)
- Production-quality CI/CD with comprehensive testing and publishing workflows
Project Structure
kotogram/
├── kotogram/ # Python package
│ ├── __init__.py # Package exports and version
│ ├── japanese_parser.py # Abstract JapaneseParser interface
│ └── mecab_japanese_parser.py # MeCab implementation
├── src/ # TypeScript source
│ ├── kotogram.ts # Kotogram conversion functions
│ └── index.ts # Package exports
├── tests-py/ # Python tests
│ └── test_japanese_parser.py # Japanese parser tests
├── tests-ts/ # TypeScript tests
│ └── kotogram.test.ts
├── .github/workflows/ # CI/CD workflows
│ ├── python_canary.yml # Python build & test
│ ├── typescript_canary.yml # TypeScript build & test
│ ├── python_publish.yml # Publish to PyPI
│ └── typescript_publish.yml # Publish to npm
├── version.txt # Single source of truth for version
├── publish.sh # Version bump and publish script
├── pyproject.toml # Python package configuration
├── package.json # TypeScript package configuration
└── tsconfig.json # TypeScript compiler configuration
Quick Start
Japanese Text Parsing
Parse Japanese text into kotogram format with full linguistic information:
Python:
from kotogram import MecabJapaneseParser, kotogram_to_japanese
# Initialize parser (requires MeCab and unidic)
parser = MecabJapaneseParser()
# Convert Japanese to kotogram
japanese = "猫を食べる"
kotogram = parser.japanese_to_kotogram(japanese)
# Result: ⌈ˢ猫ᵖn:common_noun⌉⌈ˢをᵖprt:case_particle⌉⌈ˢ食べるᵖv:general:e-ichidan-ba:terminal⌉
# Convert back to Japanese
reconstructed = kotogram_to_japanese(kotogram)
# Result: "猫を食べる"
# With spaces between tokens
spaced = kotogram_to_japanese(kotogram, spaces=True)
# Result: "猫 を 食べる"
# With furigana (IME-style readings in brackets)
with_furigana = kotogram_to_japanese(kotogram, furigana=True)
# Result: "猫[ねこ]を食べる[たべる]"
# Combine options
spaced_furigana = kotogram_to_japanese(kotogram, spaces=True, furigana=True)
# Result: "猫[ねこ] を 食べる[たべる]"
TypeScript:
import { kotogramToJapanese, splitKotogram } from 'kotogram';
// Convert Japanese to kotogram (requires Python parser)
const kotogram = "⌈ˢ猫ᵖn:common_noun⌉⌈ˢをᵖprt:case_particle⌉⌈ˢ食べるᵖv:general:e-ichidan-ba:terminal⌉";
// Convert back to Japanese
const reconstructed = kotogramToJapanese(kotogram);
// Result: "猫を食べる"
// With spaces between tokens
const spaced = kotogramToJapanese(kotogram, { spaces: true });
// Result: "猫 を 食べる"
// With furigana (IME-style readings in brackets)
const withFurigana = kotogramToJapanese(kotogram, { furigana: true });
// Result: "猫[ねこ]を食べる[たべる]"
// Split into tokens
const tokens = splitKotogram(kotogram);
// Result: ["⌈ˢ猫ᵖn:common_noun⌉", "⌈ˢをᵖprt:case_particle⌉", "⌈ˢ食べるᵖv:general:e-ichidan-ba:terminal⌉"]
Kotogram Format
The kotogram format encodes rich linguistic information in a compact representation:
⌈ˢ食べるᵖv:general:e-ichidan-ba:terminalᵇ食べるᵈ食べるʳタベル⌉
│ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ └─ pronunciation (ʳ)
│ │ │ │ │ │ │ └─ lemma (ᵈ)
│ │ │ │ │ │ └─ base form (ᵇ)
│ │ │ │ │ └─ conjugation form
│ │ │ │ └─ conjugation type
│ │ │ └─ POS detail
│ │ └─ part-of-speech (ᵖ)
│ └─ surface form (ˢ)
└─ token boundary markers (⌈⌉)
Development
Python Development
# Install in development mode
pip install -e .
# Run tests
python -m pytest tests-py/
# Run type checking
mypy kotogram/
# Build package
python -m build
TypeScript Development
# Install dependencies
npm install
# Build
npm run build
# Run tests
npm test
# Type check
npx tsc --noEmit
Testing
Python Tests
Tests are located in tests-py/ and use the unittest framework. They are also compatible with pytest.
Run tests:
python -m unittest discover -s tests-py -p 'test_*.py' -v
# or
python -m pytest tests-py/ -v
TypeScript Tests
Tests are located in tests-ts/ and use Node.js built-in test runner.
Run tests:
npm test
GitHub Workflows
Canary Builds
These workflows run on every push, pull request, and daily at 2 AM UTC:
-
.github/workflows/python_canary.yml
- Testing: Runs on Python 3.8, 3.9, 3.10, 3.11, 3.12 with unittest and pytest
- Code Coverage: Tracks test coverage and uploads to Codecov
- Code Quality:
- Black for code formatting
- isort for import sorting
- flake8 for linting (complexity limit: 10)
- pylint for advanced code quality (minimum score: 8.0)
- mypy for strict type checking
- Security:
- bandit for security vulnerability scanning
- safety for dependency vulnerability checks
- Best Practices:
- Checks for print() statements (should use logging)
- Detects TODO/FIXME comments
- Validates README.md and LICENSE files exist
- Package Validation:
- Ensures no TypeScript/JavaScript files leak into Python package
- Verifies package contents and structure
-
.github/workflows/typescript_canary.yml
- Testing: Runs on Node.js 18, 20, 22
- Type Checking: Strict TypeScript type checking with --noEmit
- Code Quality:
- ESLint for linting (if configured)
- Prettier for code formatting (if configured)
- Circular dependency detection with madge
- Performance:
- Bundle size analysis (warns if >100KB)
- Security:
- npm audit for dependency vulnerabilities
- Best Practices:
- Checks for console.log() statements
- Detects TODO/FIXME comments
- Warns about
anytypes (encourages type safety) - Validates package.json metadata (description, keywords, repository, license)
- Validates README.md and LICENSE files exist
- Package Validation:
- Ensures no Python files leak into TypeScript package
- Verifies dist/ directory contents
Publishing Workflows
These workflows are triggered when a version tag (e.g., v0.0.1) is pushed:
-
.github/workflows/python_publish.yml
- Verifies version consistency across version.txt, kotogram/init.py, and pyproject.toml
- Builds and publishes to PyPI using trusted publishing
- Verifies installation from PyPI
-
.github/workflows/typescript_publish.yml
- Verifies version consistency across version.txt and package.json
- Builds and publishes to npm with provenance
- Verifies installation from npm
Version Management
Single Source of Truth
The file version.txt contains the current version number (e.g., 0.0.1). This version must be kept in sync across:
- version.txt
- kotogram/init.py (
__version__variable) - pyproject.toml (
versionfield) - package.json (
versionfield)
The publish workflows automatically verify this consistency before publishing.
Publishing a New Version
Use the publish.sh script to bump the version and trigger publication:
# Bump patch version (0.0.1 -> 0.0.2)
./publish.sh patch
# Bump minor version (0.0.1 -> 0.1.0)
./publish.sh minor
# Bump major version (0.0.1 -> 1.0.0)
./publish.sh major
The script will:
- Increment the version number
- Update all version files
- Commit the changes
- Create a git tag (e.g.,
v0.0.2) - Push the commit and tag to GitHub
This triggers both python_publish.yml and typescript_publish.yml workflows.
Badges
The README includes status badges for build status, package versions, and license:
[](https://github.com/jomof/kotogram/actions/workflows/python_canary.yml)
[](https://github.com/jomof/kotogram/actions/workflows/typescript_canary.yml)
[](https://pypi.org/project/kotogram/)
[](https://www.npmjs.com/package/kotogram)
[](https://pypi.org/project/kotogram/)
[](LICENSE)
Note: Update the username in badge URLs if you fork this to your own repository.
Configuration Requirements
PyPI Publishing
To publish to PyPI, configure trusted publishing:
- Go to PyPI → Your Account → Publishing
- Add a new publisher with:
- Repository:
jomof/kotogram - Workflow:
python_publish.yml - Environment:
pypi
- Repository:
npm Publishing
To publish to npm, you need an npm access token:
- Create an automation token on npmjs.com
- Add it as a GitHub secret named
NPM_TOKEN - Configure the
npmenvironment in your repository settings
API Reference
JapaneseParser (Abstract Base Class)
Abstract interface for Japanese text parsing implementations.
from kotogram import JapaneseParser
class JapaneseParser(ABC):
@abstractmethod
def japanese_to_kotogram(self, text: str) -> str:
"""Convert Japanese text to kotogram compact representation."""
pass
MecabJapaneseParser
MeCab-based implementation using the UniDic dictionary.
from kotogram import MecabJapaneseParser
# Initialize with default settings
parser = MecabJapaneseParser()
# Or provide your own MeCab tagger instance
import MeCab
tagger = MeCab.Tagger('-d /path/to/unidic')
parser = MecabJapaneseParser(mecab_tagger=tagger)
# Enable validation mode for debugging unmapped features
parser_strict = MecabJapaneseParser(validate=True)
# This will raise descriptive KeyError if any MeCab features
# are missing from the mapping dictionaries
# Parse Japanese text
kotogram = parser.japanese_to_kotogram("今日は良い天気です")
Parameters:
mecab_tagger(optional): Pre-configured MeCab tagger instancevalidate(default:False): WhenTrue, raises descriptiveKeyErrorexceptions when encountering unmapped linguistic features. The error message includes:- The name of the mapping dictionary (e.g.,
POS_MAP,CONJUGATED_TYPE_MAP) - The unmapped key value
- The raw MeCab token line for context
- The name of the mapping dictionary (e.g.,
Validation Mode Example:
# With validate=True, unmapped features raise detailed errors
parser = MecabJapaneseParser(validate=True)
try:
kotogram = parser.japanese_to_kotogram("未知の単語")
except KeyError as e:
# Error message: "Missing mapping in POS_MAP: key='未知品詞' not found.
# Raw MeCab token: 未知の単語\t未知品詞,..."
print(f"Unmapped feature detected: {e}")
Helper Functions
from kotogram import kotogram_to_japanese, split_kotogram
# Convert kotogram back to Japanese
japanese = kotogram_to_japanese(kotogram_str)
japanese_with_spaces = kotogram_to_japanese(kotogram_str, spaces=True)
# Split kotogram into individual tokens
tokens = split_kotogram(kotogram_str)
Mapping Constants
Global mapping constants are available in japanese_parser module:
from kotogram.japanese_parser import (
POS_MAP, # Part-of-speech mappings
POS1_MAP, # POS detail level 1
POS2_MAP, # POS detail level 2
CONJUGATED_TYPE_MAP, # Conjugation type mappings
CONJUGATED_FORM_MAP, # Conjugation form mappings
POS_TO_CHARS, # POS to character mappings
CHAR_TO_POS, # Character to POS mappings
)
License
MIT
Contributing
This is a template project. Feel free to fork and adapt it for your own dual-language libraries!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kotogram-0.0.9.tar.gz.
File metadata
- Download URL: kotogram-0.0.9.tar.gz
- Upload date:
- Size: 48.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34ab78f0dd3767714f86b95f8ad3fb85f63b1ec9e2ac30e2d4b4657b12369d76
|
|
| MD5 |
b7c96d53f689b2dbbc87d2205908acc8
|
|
| BLAKE2b-256 |
ae8bcfd7e8366f32225fdca7fe1488dd894d1f35cd557aac6c3f0c7a32f452a5
|
Provenance
The following attestation bundles were made for kotogram-0.0.9.tar.gz:
Publisher:
python_publish.yml on jomof/kotogram
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kotogram-0.0.9.tar.gz -
Subject digest:
34ab78f0dd3767714f86b95f8ad3fb85f63b1ec9e2ac30e2d4b4657b12369d76 - Sigstore transparency entry: 757670825
- Sigstore integration time:
-
Permalink:
jomof/kotogram@17cad39b39a9fdb365979d6e57d9257996313b05 -
Branch / Tag:
refs/tags/v0.0.9 - Owner: https://github.com/jomof
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python_publish.yml@17cad39b39a9fdb365979d6e57d9257996313b05 -
Trigger Event:
push
-
Statement type:
File details
Details for the file kotogram-0.0.9-py3-none-any.whl.
File metadata
- Download URL: kotogram-0.0.9-py3-none-any.whl
- Upload date:
- Size: 46.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4852d0b63a64e2221b579856bd032d799007480971efe02238488a3c15c35b5c
|
|
| MD5 |
8e26abc987421813fb246401384395ad
|
|
| BLAKE2b-256 |
b224a73bceb239b2fe53b04b30c08750346c2833dcf16c4fb50d3f0c81c9cb2d
|
Provenance
The following attestation bundles were made for kotogram-0.0.9-py3-none-any.whl:
Publisher:
python_publish.yml on jomof/kotogram
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kotogram-0.0.9-py3-none-any.whl -
Subject digest:
4852d0b63a64e2221b579856bd032d799007480971efe02238488a3c15c35b5c - Sigstore transparency entry: 757670831
- Sigstore integration time:
-
Permalink:
jomof/kotogram@17cad39b39a9fdb365979d6e57d9257996313b05 -
Branch / Tag:
refs/tags/v0.0.9 - Owner: https://github.com/jomof
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python_publish.yml@17cad39b39a9fdb365979d6e57d9257996313b05 -
Trigger Event:
push
-
Statement type: