Skip to main content

Georgian Language Hyphenation Library v2.2.7 - Preserves compound word hyphens

Project description

Georgian Hyphenation

PyPI version Python versions License: MIT Downloads

Georgian Language Hyphenation Library - Fast, accurate syllabification for Georgian (ქართული) text with support for Python 3.7+.

Features

  • Accurate Georgian syllabification based on phonetic rules
  • Harmonic consonant clusters recognition (ბრ, გრ, კრ, etc.)
  • Gemination handling (double consonant splitting)
  • Exception dictionary for irregular words (148 words)
  • HTML-aware hyphenation - preserves tags and code blocks (new in v2.2.7)
  • 17+ utility functions for advanced text processing (new in v2.2.7)
  • Configurable settings - adjust margins and hyphen character (new in v2.2.7)
  • Method chaining support (new in v2.2.7)
  • Zero dependencies
  • Lightweight and fast
  • Type hints for better IDE support

Installation

pip install georgian-hyphenation

Quick Start

from georgian_hyphenation import GeorgianHyphenator

# Create hyphenator instance
hyphenator = GeorgianHyphenator()

# Basic hyphenation
result = hyphenator.hyphenate('საქართველო')
print(result)  # სა­ქარ­თვე­ლო

# Get syllables as a list
syllables = hyphenator.get_syllables('თბილისი')
print(syllables)  # ['თბი', 'ლი', 'სი']

# NEW in v2.2.7: Count syllables
count = hyphenator.count_syllables('გამარჯობა')
print(count)  # 4

# NEW in v2.2.7: Hyphenate HTML
html = '<p>ქართული ენა <code>console.log()</code></p>'
result = hyphenator.hyphenate_html(html)
# Code tags are preserved!

# NEW in v2.2.7: Method chaining
hyphenator = (GeorgianHyphenator()
              .set_left_min(3)
              .set_right_min(3)
              .set_hyphen_char('-'))

Core Methods

hyphenate(word: str) -> str

Hyphenate a single Georgian word.

hyphenator = GeorgianHyphenator()
result = hyphenator.hyphenate('კომპიუტერი')
print(result)  # კომ­პი­უ­ტე­რი

get_syllables(word: str) -> List[str]

Get syllables as a list.

syllables = hyphenator.get_syllables('განათლება')
print(syllables)  # ['გა', 'ნათ', 'ლე', 'ბა']

hyphenate_text(text: str) -> str

Hyphenate all Georgian words in text.

text = 'საქართველო არის ლამაზი ქვეყანა'
result = hyphenator.hyphenate_text(text)
print(result)  # სა­ქარ­თვე­ლო არის ლა­მა­ზი ქვე­ყა­ნა

New in v2.2.7: Utility Functions

count_syllables(word: str) -> int

Count the number of syllables in a word.

count = hyphenator.count_syllables('გამარჯობა')
print(count)  # 4

get_hyphenation_points(word: str) -> int

Get the number of hyphenation points (hyphens) in a word.

points = hyphenator.get_hyphenation_points('გამარჯობა')
print(points)  # 3 (four syllables = three hyphens)

is_georgian(text: str) -> bool

Check if text contains only Georgian characters.

print(hyphenator.is_georgian('გამარჯობა'))  # True
print(hyphenator.is_georgian('hello'))       # False
print(hyphenator.is_georgian('გამარჯობა123'))  # False

can_hyphenate(word: str) -> bool

Check if a word meets minimum length requirements.

print(hyphenator.can_hyphenate('გა'))     # False (too short)
print(hyphenator.can_hyphenate('გამარ'))  # True

unhyphenate(text: str) -> str

Remove all hyphenation from text.

hyphenated = hyphenator.hyphenate('გამარჯობა')
clean = hyphenator.unhyphenate(hyphenated)
print(clean)  # გამარჯობა

hyphenate_words(words: List[str]) -> List[str]

Hyphenate multiple words at once (batch processing).

words = ['ქართული', 'ენა', 'მშვენიერია']
result = hyphenator.hyphenate_words(words)
print(result)  # ['ქარ­თუ­ლი', 'ე­ნა', 'მშვე­ნი­ე­რია']

hyphenate_html(html: str) -> str ⭐ Most Useful!

Hyphenate HTML content while preserving tags and skipping code blocks.

html = '''
  <article>
    <h1>ქართული ენა</h1>
    <p>პროგრამირება და კომპიუტერული მეცნიერება</p>
    <code>console.log('skip me')</code>
    <pre>this won't be hyphenated</pre>
  </article>
'''

result = hyphenator.hyphenate_html(html)
# Only <p> content gets hyphenated
# <code>, <pre>, <script>, <style>, <textarea> are preserved

New in v2.2.7: Configuration Methods

All configuration methods support method chaining:

set_left_min(value: int) -> GeorgianHyphenator

Set minimum characters before the first hyphen (default: 2).

hyphenator.set_left_min(3)
# Now requires at least 3 characters before first hyphen

set_right_min(value: int) -> GeorgianHyphenator

Set minimum characters after the last hyphen (default: 2).

hyphenator.set_right_min(3)
# Now requires at least 3 characters after last hyphen

set_hyphen_char(char: str) -> GeorgianHyphenator

Change the hyphen character.

# Use visible hyphen for debugging
hyphenator.set_hyphen_char('-')
print(hyphenator.hyphenate('გამარჯობა'))
# Output: გა-მარ-ჯო-ბა

# Use custom separator
hyphenator.set_hyphen_char('•')
print(hyphenator.hyphenate('საქართველო'))
# Output: სა•ქარ•თვე•ლო

Method Chaining

hyphenator = (GeorgianHyphenator()
              .set_left_min(3)
              .set_right_min(3)
              .set_hyphen_char('-'))

print(hyphenator.hyphenate('გამარჯობა'))

New in v2.2.7: Dictionary Management

load_library(data: Dict[str, str]) -> None

Load custom exception dictionary.

custom_words = {
    'განათლება': 'გა-ნათ-ლე-ბა',
    'უნივერსიტეტი': 'უ-ნი-ვერ-სი-ტე-ტი'
}

hyphenator.load_library(custom_words)

load_default_library() -> None

Load the built-in exception dictionary (148 words).

hyphenator.load_default_library()
# Dictionary loaded with tech terms, places, political terms

add_exception(word: str, hyphenated: str) -> GeorgianHyphenator

Add a single custom hyphenation exception.

hyphenator.add_exception('ტესტი', 'ტეს-ტი')

print(hyphenator.hyphenate('ტესტი'))
# Returns: ტეს­ტი (uses your custom hyphenation)

remove_exception(word: str) -> bool

Remove an exception from the dictionary.

removed = hyphenator.remove_exception('ტესტი')
print(removed)  # True if word was removed

export_dictionary() -> Dict[str, str]

Export the entire dictionary as a dict.

dict_data = hyphenator.export_dictionary()
print(dict_data)
# {'გამარჯობა': 'გა-მარ-ჯო-ბა', ...}

get_dictionary_size() -> int

Get the number of words in the dictionary.

hyphenator.load_default_library()
print(hyphenator.get_dictionary_size())
# Output: 148

New in v2.2.7: Advanced Features

Harmonic Cluster Management

For advanced users who need to customize consonant cluster recognition:

# Add a custom harmonic cluster
hyphenator.add_harmonic_cluster('ტვ')

# Remove a cluster
hyphenator.remove_harmonic_cluster('ტვ')

# Get all clusters
clusters = hyphenator.get_harmonic_clusters()
print(clusters)
# ['ბლ', 'ბრ', 'ბღ', ... (70+ clusters)]

Custom Hyphen Character

# Use visible hyphen instead of soft hyphen
hyphenator = GeorgianHyphenator(hyphen_char='-')
print(hyphenator.hyphenate('საქართველო'))
# Output: სა-ქარ-თვე-ლო

# Use custom separator
hyphenator = GeorgianHyphenator(hyphen_char='•')
print(hyphenator.hyphenate('საქართველო'))
# Output: სა•ქარ•თვე•ლო

Built-in Dictionary

The library includes 148 pre-hyphenated words including:

Tech Terms: კომპიუტერი, ფეისბუქი, იუთუბი, ინსტაგრამი
Places: საქართველო, თბილისი
Political: პარლამენტი, დემოკრატია, რესპუბლიკა
Compound Words: სახელმწიფო, გულმავიწყი, თავდადებული

hyphenator.load_default_library()
print(hyphenator.hyphenate('კომპიუტერი'))
# Uses dictionary: კომ­პიუ­ტე­რი

Convenience Functions

For quick one-off usage without creating an instance:

from georgian_hyphenation import hyphenate, get_syllables, hyphenate_text

# Quick hyphenation
print(hyphenate('საქართველო'))

# Quick syllable extraction
print(get_syllables('თბილისი'))

# Quick text hyphenation
print(hyphenate_text('ეს არის ტექსტი'))

Export Formats

TeX Pattern Format

from georgian_hyphenation import to_tex_pattern

pattern = to_tex_pattern('საქართველო')
print(pattern)  # .სა1ქარ1თვე1ლო.

Hunspell Format

from georgian_hyphenation import to_hunspell_format

hunspell = to_hunspell_format('საქართველო')
print(hunspell)  # სა=ქარ=თვე=ლო

Use Cases & Examples

E-book Generator

from georgian_hyphenation import GeorgianHyphenator

hyphenator = GeorgianHyphenator()
hyphenator.load_default_library()

def format_for_ebook(paragraphs):
    formatted = []
    for paragraph in paragraphs:
        formatted.append(hyphenator.hyphenate_text(paragraph))
    return '\n\n'.join(formatted)

Blog/CMS Integration

from georgian_hyphenation import GeorgianHyphenator

hyphenator = GeorgianHyphenator()
hyphenator.load_default_library()

def process_article(html_content):
    """Process article HTML for better typography"""
    return hyphenator.hyphenate_html(html_content)

Form Validation

hyphenator = GeorgianHyphenator()

def validate_georgian_input(text):
    if not hyphenator.is_georgian(text):
        raise ValueError('გთხოვთ შეიყვანოთ მხოლოდ ქართული ტექსტი')
    return True

Syllable Counter

def count_syllables_in_text(text):
    hyphenator = GeorgianHyphenator()
    words = text.split()
    total = 0
    
    for word in words:
        clean_word = ''.join(c for c in word if c.isalpha())
        if clean_word:
            total += hyphenator.count_syllables(clean_word)
    
    return total

text = "საქართველო არის ლამაზი ქვეყანა"
print(f"Total syllables: {count_syllables_in_text(text)}")

Poetry Analyzer

from georgian_hyphenation import GeorgianHyphenator

def analyze_verse(line):
    """Analyze syllable structure of Georgian poetry"""
    hyphenator = GeorgianHyphenator('-')
    words = line.split()
    
    analysis = []
    for word in words:
        syllables = hyphenator.get_syllables(word)
        analysis.append({
            'word': word,
            'syllables': syllables,
            'count': len(syllables)
        })
    
    return analysis

verse = "მთვარე ანათებს ცისკარზე"
print(analyze_verse(verse))

Algorithm

The library uses a phonetic algorithm based on Georgian syllable structure:

Rules Applied:

  1. Vowel Detection: Identifies Georgian vowels (ა, ე, ი, ო, უ)
  2. Consonant Cluster Analysis: Recognizes 70+ harmonic clusters
  3. Gemination Rules: Splits double consonants (კკ → კ­კ)
  4. Orphan Prevention: Ensures minimum syllable length (2 characters by default)
  5. Dictionary Lookup: Checks exceptions first for accuracy

Supported Harmonic Clusters:

ბლ, ბრ, ბღ, ბზ, გდ, გლ, გმ, გნ, გვ, გზ, გრ, დრ, თლ, თრ, თღ, 
კლ, კმ, კნ, კრ, კვ, მტ, პლ, პრ, ჟღ, რგ, რლ, რმ, სწ, სხ, ტკ, 
ტპ, ტრ, ფლ, ფრ, ფქ, ფშ, ქლ, ქნ, ქვ, ქრ, ღლ, ღრ, ყლ, ყრ, შთ, 
შპ, ჩქ, ჩრ, ცლ, ცნ, ცრ, ცვ, ძგ, ძვ, ძღ, წლ, წრ, წნ, წკ, ჭკ, 
ჭრ, ჭყ, ხლ, ხმ, ხნ, ხვ, ჯგ

Syllable Patterns:

  • V-V: Split between vowels (გა­ა­ნა­თლე­ბა)
  • V-C-V: Split after first vowel (მა­მა)
  • V-CC-V: Split between consonants (გარ­გა­რი)
  • V-ხრ-V: Keep harmonic clusters together (ას­ტრო­ნო­მი­ა)
  • V-კკ-V: Split gemination (კლას­სი)

Performance

  • Speed: ~0.05ms per word on average
  • HTML Processing: ~2ms for 1000 words
  • Memory: ~100KB with dictionary loaded
  • Optimization: Uses Set for O(1) cluster lookups

API Reference

Main Class

GeorgianHyphenator(hyphen_char: str = '\u00AD')

Parameters:

  • hyphen_char (str): Character to use for hyphenation. Default is soft hyphen (U+00AD)

Core Methods:

  • hyphenate(word: str) -> str
  • get_syllables(word: str) -> List[str]
  • hyphenate_text(text: str) -> str
  • apply_algorithm(word: str) -> str

New Utility Methods (v2.2.7):

  • count_syllables(word: str) -> int
  • get_hyphenation_points(word: str) -> int
  • is_georgian(text: str) -> bool
  • can_hyphenate(word: str) -> bool
  • unhyphenate(text: str) -> str
  • hyphenate_words(words: List[str]) -> List[str]
  • hyphenate_html(html: str) -> str

Configuration Methods (v2.2.7):

  • set_left_min(value: int) -> GeorgianHyphenator
  • set_right_min(value: int) -> GeorgianHyphenator
  • set_hyphen_char(char: str) -> GeorgianHyphenator

Dictionary Methods (v2.2.7):

  • load_library(data: Dict[str, str]) -> None
  • load_default_library() -> None
  • add_exception(word: str, hyphenated: str) -> GeorgianHyphenator
  • remove_exception(word: str) -> bool
  • export_dictionary() -> Dict[str, str]
  • get_dictionary_size() -> int

Advanced Methods (v2.2.7):

  • add_harmonic_cluster(cluster: str) -> GeorgianHyphenator
  • remove_harmonic_cluster(cluster: str) -> bool
  • get_harmonic_clusters() -> List[str]

Convenience Functions

hyphenate(word: str, hyphen_char: str = '\u00AD') -> str
get_syllables(word: str) -> List[str]
hyphenate_text(text: str, hyphen_char: str = '\u00AD') -> str
to_tex_pattern(word: str) -> str
to_hunspell_format(word: str) -> str

Changelog

v2.2.7 (2025-02-13) 🎉

New Features (17 functions added):

Utility Functions:

  • count_syllables(word) - Get syllable count
  • get_hyphenation_points(word) - Get hyphen count
  • is_georgian(text) - Validate Georgian text
  • can_hyphenate(word) - Check if word can be hyphenated
  • unhyphenate(text) - Remove all hyphens
  • hyphenate_words(words) - Batch processing
  • hyphenate_html(html) - HTML-aware hyphenation 🌟

Configuration (Chainable):

  • set_left_min(value) - Configure left margin
  • set_right_min(value) - Configure right margin
  • set_hyphen_char(char) - Change hyphen character

Dictionary Management:

  • add_exception(word, hyphenated) - Add custom word
  • remove_exception(word) - Remove exception
  • export_dictionary() - Export as dict
  • get_dictionary_size() - Get word count

Advanced:

  • add_harmonic_cluster(cluster) - Add custom cluster
  • remove_harmonic_cluster(cluster) - Remove cluster
  • get_harmonic_clusters() - List all clusters

Improvements:

  • 🔧 All configuration methods support method chaining
  • 📚 Comprehensive docstrings for all methods
  • ✅ 100% backwards compatible
  • 🎯 No breaking changes

v2.2.6 (2026-01-30)

  • ✨ Preserves regular hyphens in compound words
  • 🐛 Fixed hyphen stripping behavior
  • 📝 Improved documentation

v2.2.5

  • Dictionary support added
  • Performance optimizations

Testing

# Install the package
pip install georgian-hyphenation

# Test in Python
python -c "
from georgian_hyphenation import GeorgianHyphenator
h = GeorgianHyphenator()
h.load_default_library()
print('✅ Dictionary:', h.get_dictionary_size(), 'words')
print('✅ New functions work:', h.count_syllables('გამარჯობა'))
"

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

MIT © Guram Zhgamadze


Author

Guram Zhgamadze


Related Projects


Citation

If you use this library in academic work, please cite:

@software{georgian_hyphenation,
  author = {Zhgamadze, Guram},
  title = {Georgian Hyphenation Library},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/guramzhgamadze/georgian-hyphenation}
}

Made with ❤️ for the Georgian language community

ქართული ენის თანამშრომლობისთვის

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

georgian_hyphenation-2.2.7.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

georgian_hyphenation-2.2.7-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file georgian_hyphenation-2.2.7.tar.gz.

File metadata

  • Download URL: georgian_hyphenation-2.2.7.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for georgian_hyphenation-2.2.7.tar.gz
Algorithm Hash digest
SHA256 1917228deece5ef5df1ee319a7396a17f7ad7ca9ba30d4b5753ad0e338a8d9de
MD5 7091620729643a201ce6420f192f923c
BLAKE2b-256 90a8d4325dec0423a41ff45795064f2b80186ed0b2ab0413f5a5c4abd4c3fe5b

See more details on using hashes here.

File details

Details for the file georgian_hyphenation-2.2.7-py3-none-any.whl.

File metadata

File hashes

Hashes for georgian_hyphenation-2.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 6c2bbfcd10211ba8f64cc4b12644996a0e84c07f5d1403f926de66d47fed18ab
MD5 593861da265ddb2bf1e2966d2d555297
BLAKE2b-256 e2476def4895dfdb2587846776f7c16307244ab3fd505afa21def28baa26db9e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page