Skip to main content

Extract body text from Japanese business emails

Project description

yomail (読メール)

yomail extracts body text from Japanese business emails. It uses a CRF (Conditional Random Field) model to classify each line, then assembles the body from labeled lines.

Features

  • Handles formal and informal Japanese business emails
  • Detects and excludes signatures, greetings, closings, quoted content
  • Works with forwarded emails, replies, and inline quotes
  • Returns confidence scores for quality control
  • Small model size (12 KB)
  • Fast inference (~10-30ms)

Installation

pip install yomail

Requires Python 3.13+.

Usage

from yomail import EmailBodyExtractor

extractor = EmailBodyExtractor()

# Raises on failure
body = extractor.extract(email_text)

# Returns None on failure
body = extractor.extract_safe(email_text)

# Full result with metadata
result = extractor.extract_with_metadata(email_text)
print(result.body)
print(result.confidence)
print(result.signature_detected)

Example

Input:

株式会社サンプル
田中様

お世話になっております。
山田です。

先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。

以上

--
山田太郎
株式会社テスト
TEL: 03-1234-5678

Output:

お世話になっております。
山田です。

先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。

以上

How It Works

The extraction pipeline:

  1. Normalize — Line endings, neologdn normalization, NFKC
  2. Analyze structure — Quote depth, forward/reply headers, delimiters
  3. Extract features — Position, character ratios, pattern matches
  4. Label with CRF — GREETING, BODY, CLOSING, SIGNATURE, QUOTE, OTHER
  5. Assemble body — Find signature boundary, handle inline quotes, merge blocks

See ARCHITECTURE.md for details.

Label Scheme

Label Description
GREETING Opening (お世話になっております)
BODY Main content
CLOSING Closing (よろしくお願いいたします)
SIGNATURE Sender information
QUOTE Quoted content
OTHER Separators, blank lines

Performance

Evaluated on 19,642 synthetic test emails:

Metric Value
Content match 97.9%
Acceptable rate 98.0%
Confident wrong 0.14%

See PERFORMANCE.md for details.

Exceptions

from yomail import (
    ExtractionError,      # Base class
    InvalidInputError,    # Empty or invalid input
    NoBodyDetectedError,  # No body found
    LowConfidenceError,   # Confidence below threshold
)

Configuration

extractor = EmailBodyExtractor(
    model_path="path/to/model.crfsuite",  # Custom model
    confidence_threshold=0.5,              # Minimum confidence
)

Development

# Setup
uv sync

# Run tests
uv run pytest

# Type check
uv run ty check

# Lint
uv run ruff check .

Training

Training data is generated by the yasumail project.

# Train model
python scripts/train.py data/training.jsonl -o models/email_body.crfsuite

# Evaluate
python scripts/evaluate.py data/test.jsonl

Dependencies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yomail-0.1.0.tar.gz (190.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yomail-0.1.0-py3-none-any.whl (139.1 kB view details)

Uploaded Python 3

File details

Details for the file yomail-0.1.0.tar.gz.

File metadata

  • Download URL: yomail-0.1.0.tar.gz
  • Upload date:
  • Size: 190.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yomail-0.1.0.tar.gz
Algorithm Hash digest
SHA256 95dd2c281cfeeed238871e7c37e6531c188a0b13aca6aa5454b1e3c2d5d98863
MD5 d656a9f2326b42b236b11d68b8b809d6
BLAKE2b-256 8a2e1dca1033d62aecc7266c03dc9471e86af6d19a0e0149ef922bd0ad15afa1

See more details on using hashes here.

Provenance

The following attestation bundles were made for yomail-0.1.0.tar.gz:

Publisher: publish.yml on terallite/yomail

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yomail-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: yomail-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 139.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yomail-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 73a3bd6fd5b1cc4ab683300f7008d817bdebd7441f9612dbe5ec852b62da48ec
MD5 1da786925f7668cf0a5c5864ef4dca8a
BLAKE2b-256 42989d1b0e6c817ae4bccafed10248ee4a1182654b12e2a9d5488785b8eb6ffe

See more details on using hashes here.

Provenance

The following attestation bundles were made for yomail-0.1.0-py3-none-any.whl:

Publisher: publish.yml on terallite/yomail

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page