Extract body text from Japanese business emails
Project description
yomail (読メール)
yomail extracts body text from Japanese business emails. It uses a CRF (Conditional Random Field) model to classify each line, then assembles the body from labeled lines.
Features
- Handles formal and informal Japanese business emails
- Detects and excludes signatures, greetings, closings, quoted content
- Works with forwarded emails, replies, and inline quotes
- Returns confidence scores for quality control
- Small model size (12 KB)
- Fast inference (~10-30ms)
Installation
pip install yomail
Requires Python 3.13+.
Usage
from yomail import EmailBodyExtractor
extractor = EmailBodyExtractor()
# Raises on failure
body = extractor.extract(email_text)
# Returns None on failure
body = extractor.extract_safe(email_text)
# Full result with metadata
result = extractor.extract_with_metadata(email_text)
print(result.body)
print(result.confidence)
print(result.signature_detected)
Example
Input:
株式会社サンプル
田中様
お世話になっております。
山田です。
先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。
以上
--
山田太郎
株式会社テスト
TEL: 03-1234-5678
Output:
お世話になっております。
山田です。
先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。
以上
How It Works
The extraction pipeline:
- Normalize — Line endings, neologdn normalization, NFKC
- Analyze structure — Quote depth, forward/reply headers, delimiters
- Extract features — Position, character ratios, pattern matches
- Label with CRF — GREETING, BODY, CLOSING, SIGNATURE, QUOTE, OTHER
- Assemble body — Find signature boundary, handle inline quotes, merge blocks
See ARCHITECTURE.md for details.
Label Scheme
| Label | Description |
|---|---|
| GREETING | Opening (お世話になっております) |
| BODY | Main content |
| CLOSING | Closing (よろしくお願いいたします) |
| SIGNATURE | Sender information |
| QUOTE | Quoted content |
| OTHER | Separators, blank lines |
Performance
Evaluated on 19,642 synthetic test emails:
| Metric | Value |
|---|---|
| Content match | 97.9% |
| Acceptable rate | 98.0% |
| Confident wrong | 0.14% |
See PERFORMANCE.md for details.
Exceptions
from yomail import (
ExtractionError, # Base class
InvalidInputError, # Empty or invalid input
NoBodyDetectedError, # No body found
LowConfidenceError, # Confidence below threshold
)
Configuration
extractor = EmailBodyExtractor(
model_path="path/to/model.crfsuite", # Custom model
confidence_threshold=0.5, # Minimum confidence
)
Development
# Setup
uv sync
# Run tests
uv run pytest
# Type check
uv run ty check
# Lint
uv run ruff check .
Training
Training data is generated by the yasumail project.
# Train model
python scripts/train.py data/training.jsonl -o models/email_body.crfsuite
# Evaluate
python scripts/evaluate.py data/test.jsonl
Dependencies
- neologdn — Japanese text normalization
- python-crfsuite — CRF implementation
- PyYAML — Name data loading
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yomail-0.1.0.tar.gz.
File metadata
- Download URL: yomail-0.1.0.tar.gz
- Upload date:
- Size: 190.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95dd2c281cfeeed238871e7c37e6531c188a0b13aca6aa5454b1e3c2d5d98863
|
|
| MD5 |
d656a9f2326b42b236b11d68b8b809d6
|
|
| BLAKE2b-256 |
8a2e1dca1033d62aecc7266c03dc9471e86af6d19a0e0149ef922bd0ad15afa1
|
Provenance
The following attestation bundles were made for yomail-0.1.0.tar.gz:
Publisher:
publish.yml on terallite/yomail
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yomail-0.1.0.tar.gz -
Subject digest:
95dd2c281cfeeed238871e7c37e6531c188a0b13aca6aa5454b1e3c2d5d98863 - Sigstore transparency entry: 852923300
- Sigstore integration time:
-
Permalink:
terallite/yomail@4cf6f8b2d634bd4ccdc63b24096db6f3195eab4b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/terallite
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4cf6f8b2d634bd4ccdc63b24096db6f3195eab4b -
Trigger Event:
release
-
Statement type:
File details
Details for the file yomail-0.1.0-py3-none-any.whl.
File metadata
- Download URL: yomail-0.1.0-py3-none-any.whl
- Upload date:
- Size: 139.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73a3bd6fd5b1cc4ab683300f7008d817bdebd7441f9612dbe5ec852b62da48ec
|
|
| MD5 |
1da786925f7668cf0a5c5864ef4dca8a
|
|
| BLAKE2b-256 |
42989d1b0e6c817ae4bccafed10248ee4a1182654b12e2a9d5488785b8eb6ffe
|
Provenance
The following attestation bundles were made for yomail-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on terallite/yomail
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yomail-0.1.0-py3-none-any.whl -
Subject digest:
73a3bd6fd5b1cc4ab683300f7008d817bdebd7441f9612dbe5ec852b62da48ec - Sigstore transparency entry: 852923312
- Sigstore integration time:
-
Permalink:
terallite/yomail@4cf6f8b2d634bd4ccdc63b24096db6f3195eab4b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/terallite
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4cf6f8b2d634bd4ccdc63b24096db6f3195eab4b -
Trigger Event:
release
-
Statement type: