Skip to main content

N-gram Punctuator

Project description

N-gram Punctuator

PyPI License

An N-gram based punctuation restoration tool that automatically adds punctuation to text without punctuation marks.

Features

  • Restores punctuation marks to unpunctuated text using N-gram language models
  • Supports multiple punctuation marks including !, ', ,, ., ?, , , , , and
  • Configurable N-gram order (3-gram to 6-gram)
  • Beam search algorithm for optimal punctuation placement
  • CLI interface for easy usage
  • Support for both English and Chinese text

Installation

pip install ngram-punctuator

Usage

Command Line Interface

# Basic usage
ngram-punctuator "how are you"
# >>> how are you?

# Specify N-gram order (3, 4, 5, or 6)
ngram-punctuator --order 4 "你好吗"
# >>> 你好吗?

# Adjust beam size for better accuracy (higher values may improve results but slow down processing)
ngram-punctuator --beam-size 10 "Artificial intelligence is changing our daily lives in many ways from smart home devices to personalized recommendations it makes technology more convenient and efficient"
# >>> Artificial intelligence is changing our daily lives, in many ways, from smart home devices to personalized recommendations, it makes technology more convenient and efficient.

# Limit the maximum number of punctuation marks to add
ngram-punctuator --max-puncts 8 "中华文明有着五千年的悠久历史从夏商周到秦汉唐宋元明清每个朝代都留下了丰富的文化遗产长城故宫兵马俑敦煌莫高窟这些都是中华民族的宝贵财富值得我们好好保护和传承"
# >>> 中华文明有着五千年的悠久历史,从夏商周到秦汉唐宋元明清,每个朝代都留下了丰富的文化遗产长城故宫兵马俑,敦煌莫高窟,这些都是中华民族的宝贵财富,值得我们好好保护和传承。

# Adjust perplexity drop ratio for more conservative punctuation
ngram-punctuator --ppl-drop-ratio 0.1 "这个new feature的UI设计需要optimize一下user experience特别是mobile端的responsive design要考虑cross platform compatibility还有API的integration问题我们要做AB testing来validate hypothesis"
# >>> 这个 new feature 的 UI 设计需要 optimize 一下 user experience, 特别是 mobile 端的 responsive design, 要考虑 cross platform compatibility, 还有 API 的 integration 问题,我们要做 AB testing, 来 validate hypothesis.

Python API

from ngram_punctuator import Punctuator

# Initialize punctuator with default settings (3-gram model)
punctuator = Punctuator()

# Add punctuation to text
text = "The sun sets slowly over the calm blue ocean"
result = punctuator.predict(text)
print(result)  # Output: "The sun sets, slowly over the calm blue ocean."

# Initialize with specific N-gram order
punctuator = Punctuator(order=4)

# Advanced usage with parameters
result = punctuator.predict(
    text="人工智能技术正在深刻改变我们的生活方式从智能手机到自动驾驶汽车从医疗诊断到金融风控AI的应用已经渗透到各个领域",
    beam_size=10,
    max_puncts=5,
    ppl_drop_ratio=0.15
)
print(result)  # Output: "人工智能技术,正在深刻改变我们的生活方式。从智能手机到自动驾驶汽车从医疗诊断到金融风控 AI 的应用,已经渗透到各个领域。"

How It Works

The N-gram Punctuator uses statistical language models to determine the most likely positions for punctuation marks in unpunctuated text:

  1. Text Preprocessing: The input text is tokenized using a BPE (Byte Pair Encoding) tokenizer
  2. N-gram Perplexity: N-gram language models calculate perplexity for different punctuation placement possibilities
  3. Beam Search: A beam search algorithm explores multiple punctuation placement options
  4. Optimization: The system selects the punctuation arrangement with the lowest perplexity

The models are trained on large text corpora and can effectively restore punctuation for both English and Chinese text.

Parameters

  • order: N-gram order (3, 4, 5, or 6). Higher orders may capture more context but require more computational resources.
  • beam_size: Number of candidates to keep during beam search. Larger values may improve accuracy but slow down processing.
  • max_puncts: Maximum number of punctuation marks to insert. If not specified, it defaults to 1/4 of the text length.
  • ppl_drop_ratio: Minimum perplexity drop ratio (between 0.0 and 1.0). Higher values make the system more conservative in adding punctuation.

License

LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ngram_punctuator-0.0.1.tar.gz (18.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ngram_punctuator-0.0.1-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file ngram_punctuator-0.0.1.tar.gz.

File metadata

  • Download URL: ngram_punctuator-0.0.1.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ngram_punctuator-0.0.1.tar.gz
Algorithm Hash digest
SHA256 e1cd7365269c64df1bac284bb99eb19a5395d4fff367b9ae23796ce5efe6a2e0
MD5 91bfbe1a037f832c4d0730af2db72c6d
BLAKE2b-256 63c0795d1b93c07ca685250b30a4af63945f13ffcc78ef49f681f3fd331508e7

See more details on using hashes here.

File details

Details for the file ngram_punctuator-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ngram_punctuator-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fcce9c43a608a6f5f3eee52fb8a26e662abcd3da3fcc9a273821bc83b94397c0
MD5 2ee7ca0791916960964b15c0ed7e9111
BLAKE2b-256 4c3ce52604436f21f3c44a5b9fe33868a06af10a229bd23df54fb4f1abc82f1f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page