Skip to main content

Extract body text from Japanese business emails

Project description

yomail (読メール)

yomail extracts body text from Japanese business emails. It uses a CRF (Conditional Random Field) model to classify each line, then assembles the body from labeled lines.

Features

  • Handles formal and informal Japanese business emails
  • Detects and excludes signatures, greetings, closings, quoted content
  • Works with forwarded emails, replies, and inline quotes
  • Returns confidence scores for quality control
  • Small model size (12 KB)
  • Fast inference (~10-30ms)

Installation

pip install yomail

Requires Python 3.12+.

Usage

from yomail import EmailBodyExtractor

extractor = EmailBodyExtractor()

# Raises on failure
body = extractor.extract(email_text)

# Returns None on failure
body = extractor.extract_safe(email_text)

# Full result with metadata
result = extractor.extract_with_metadata(email_text)
print(result.body)
print(result.confidence)
print(result.signature_detected)

Example

Input:

株式会社サンプル
田中様

お世話になっております。
山田です。

先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。

以上

--
山田太郎
株式会社テスト
TEL: 03-1234-5678

Output:

お世話になっております。
山田です。

先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。

以上

How It Works

The extraction pipeline:

  1. Normalize — Line endings, neologdn normalization, NFKC
  2. Analyze structure — Quote depth, forward/reply headers, delimiters
  3. Extract features — Position, character ratios, pattern matches
  4. Label with CRF — GREETING, BODY, CLOSING, SIGNATURE, QUOTE, OTHER
  5. Assemble body — Find signature boundary, handle inline quotes, merge blocks

See ARCHITECTURE.md for details.

Label Scheme

Label Description
GREETING Opening (お世話になっております)
BODY Main content
CLOSING Closing (よろしくお願いいたします)
SIGNATURE Sender information
QUOTE Quoted content
OTHER Separators, blank lines

Performance

Evaluated on 19,642 synthetic test emails:

Metric Value
Content match 97.9%
Acceptable rate 98.0%
Confident wrong 0.14%

See PERFORMANCE.md for details.

Exceptions

from yomail import (
    ExtractionError,      # Base class
    InvalidInputError,    # Empty or invalid input
    NoBodyDetectedError,  # No body found
    LowConfidenceError,   # Confidence below threshold
)

Configuration

extractor = EmailBodyExtractor(
    model_path="path/to/model.crfsuite",  # Custom model
    confidence_threshold=0.5,              # Minimum confidence
)

Development

# Setup
uv sync

# Run tests
uv run pytest

# Type check
uv run ty check

# Lint
uv run ruff check .

Training

Training data is generated by the yasumail project.

# Train model
python scripts/train.py data/training.jsonl -o models/email_body.crfsuite

# Evaluate
python scripts/evaluate.py data/test.jsonl

Dependencies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yomail-0.1.1.tar.gz (194.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yomail-0.1.1-py3-none-any.whl (139.1 kB view details)

Uploaded Python 3

File details

Details for the file yomail-0.1.1.tar.gz.

File metadata

  • Download URL: yomail-0.1.1.tar.gz
  • Upload date:
  • Size: 194.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yomail-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a891cb6ae70372c6781d212cc0b728b8cb78e354e2baf3abf0cd4b90779e528a
MD5 a6bfae2e5bdc62b1ee4b7e2f541be79a
BLAKE2b-256 ee3b7f2d2f448bedea95ad36debabae54c7450afb6130c8c3a761b55531b7d09

See more details on using hashes here.

Provenance

The following attestation bundles were made for yomail-0.1.1.tar.gz:

Publisher: publish.yml on terallite/yomail

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yomail-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: yomail-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 139.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yomail-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6a2cad4e44566389c0baa18c279dee2b0c504958afaf7955be9eaafa0bc6558d
MD5 7f18aed62bbe354de491902b70d9e56b
BLAKE2b-256 cf9af6e76c108e9e06a8a9212bf21625f15c9fabaf444a844ccbb3c88f74939a

See more details on using hashes here.

Provenance

The following attestation bundles were made for yomail-0.1.1-py3-none-any.whl:

Publisher: publish.yml on terallite/yomail

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page