Georgian Language Hyphenation Library v2.2.1 - Modernized & Optimized with Dictionary Support
Project description
๐ฌ๐ช Georgian Hyphenation - Python Library
Georgian Language Hyphenation Library v2.2.1 - แฅแแ แแฃแแ แแแแก แแแแแ แชแแแแก แแแแแแแแแแ
Automatic hyphenation (syllabification) for Georgian text with hybrid engine: Algorithm + Dictionary.
โจ Features
v2.2.1 (Latest)
- ๐ฏ Hybrid Engine: Algorithm + Dictionary (150+ exception words)
- โก Optimized Performance: Set-based harmonic cluster lookup (O(1))
- ๐ Strip & Re-hyphenate: Corrects old incorrect hyphenation
- ๐ต Harmonic Clusters: Preserves natural Georgian sound clusters (แแ, แแ, แแ , etc.)
- ๐ Gemination Handling: Splits double consonants correctly (rare in Georgian)
- ๐ก๏ธ Anti-Orphan Protection: Minimum 2 characters on each side
- ๐ Pure Python: No external dependencies
- ๐ Unicode Support: Full Georgian script support
Core Algorithm
- Phonological distance analysis
- Vowel-based syllable detection
- Contextual consonant cluster handling
- Punctuation preservation
๐ฆ Installation
pip install georgian-hyphenation
Requirements
- Python 3.7+
- No external dependencies (uses only standard library)
๐ Quick Start
Basic Usage
from georgian_hyphenation import GeorgianHyphenator
# Initialize with visible hyphen
hyphenator = GeorgianHyphenator('-')
# Hyphenate single word
print(hyphenator.hyphenate('แกแแฅแแ แแแแแ'))
# Output: แกแ-แฅแแ -แแแ-แแ
# Hyphenate text
text = 'แกแแฅแแ แแแแแ แแ แแก แแแแแแ แฅแแแงแแแ'
print(hyphenator.hyphenate_text(text))
# Output: แกแ-แฅแแ -แแแ-แแ แแ แแก แแ-แแ-แแ แฅแแ-แงแ-แแ
# Get syllables as list
syllables = hyphenator.get_syllables('แแแแแฅแแแแฅแ')
print(syllables)
# Output: ['แแ', 'แแ', 'แฅแ', 'แแ', 'แฅแ']
Using Dictionary (Recommended)
from georgian_hyphenation import GeorgianHyphenator
hyphenator = GeorgianHyphenator('-')
# Load default dictionary (150+ exception words)
hyphenator.load_default_library()
# Now hyphenation will use dictionary first, then algorithm
print(hyphenator.hyphenate('แแแแแแฃแขแแ แ'))
# Output: แแแ-แแแฃ-แขแ-แ แ (from dictionary)
Convenience Functions
from georgian_hyphenation import hyphenate, get_syllables, hyphenate_text
# Quick hyphenation with default settings
print(hyphenate('แกแแฅแแ แแแแแ'))
# Output: แกแยญแฅแแ ยญแแแยญแแ (with soft hyphens U+00AD)
# Get syllables
print(get_syllables('แแแแแ แแแ'))
# Output: ['แแแแ', 'แ แ', 'แแ']
# Hyphenate entire text
text = 'แกแแฅแแ แแแแแ แแ แแก แแแแแแ แฅแแแงแแแ'
print(hyphenate_text(text))
๐จ Hyphen Character Options
Soft Hyphen (Invisible, default)
# Soft hyphen (U+00AD) - invisible, only appears at line breaks
hyphenator = GeorgianHyphenator('\u00AD')
print(hyphenator.hyphenate('แกแแฅแแ แแแแแ'))
# Output: แกแยญแฅแแ ยญแแแยญแแ (hyphens invisible until line wraps)
Visible Hyphen
# Regular hyphen - always visible
hyphenator = GeorgianHyphenator('-')
print(hyphenator.hyphenate('แกแแฅแแ แแแแแ'))
# Output: แกแ-แฅแแ -แแแ-แแ
Middle Dot
# Middle dot - useful for visualization
hyphenator = GeorgianHyphenator('ยท')
print(hyphenator.hyphenate('แกแแฅแแ แแแแแ'))
# Output: แกแยทแฅแแ ยทแแแยทแแ
Custom Character
# Any character you want
hyphenator = GeorgianHyphenator('|')
print(hyphenator.hyphenate('แกแแฅแแ แแแแแ'))
# Output: แกแ|แฅแแ |แแแ|แแ
๐ Advanced Usage
Custom Dictionary
from georgian_hyphenation import GeorgianHyphenator
hyphenator = GeorgianHyphenator('-')
# Add your own exception words
custom_dict = {
'แแแแแแฃแขแแ แ': 'แแแ-แแแฃ-แขแ-แ แ',
'แแ แแแ แแแ': 'แแ แแ-แ แ-แแ',
'แแแขแแ แแแขแ': 'แแ-แขแแ -แแ-แขแ'
}
hyphenator.load_library(custom_dict)
# Now these words will use your custom hyphenation
print(hyphenator.hyphenate('แแแแแแฃแขแแ แ'))
# Output: แแแ-แแแฃ-แขแ-แ แ
Combining Default + Custom Dictionary
hyphenator = GeorgianHyphenator('-')
# Load default dictionary first
hyphenator.load_default_library()
# Add your custom words
hyphenator.load_library({
'แกแแแชแแแแฃแ แ': 'แกแแ-แชแ-แ-แแฃ-แ แ'
})
# Now has both default + custom exceptions
Export Formats
from georgian_hyphenation import to_tex_pattern, to_hunspell_format
# TeX hyphenation pattern
print(to_tex_pattern('แกแแฅแแ แแแแแ'))
# Output: .แกแ1แฅแแ 1แแแ1แแ.
# Hunspell format
print(to_hunspell_format('แกแแฅแแ แแแแแ'))
# Output: แกแ=แฅแแ =แแแ=แแ
Processing Files
from georgian_hyphenation import GeorgianHyphenator
hyphenator = GeorgianHyphenator('\u00AD')
hyphenator.load_default_library()
# Read file
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Hyphenate
hyphenated = hyphenator.hyphenate_text(text)
# Write output
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(hyphenated)
๐ฌ How It Works
v2.2.1 Hybrid Engine
- Sanitization: Strip existing hyphens from input
- Dictionary Lookup: Check exception words first (if loaded)
- Algorithm Fallback: Apply phonological rules if not in dictionary
Algorithm Rules
1. Vowel Detection
แกแแฅแแ แแแแแ โ vowels at positions: [1, 3, 5, 7]
2. Consonant Cluster Analysis
Between each vowel pair:
- 0 consonants (V-V): Split between vowels
'แแแแแแแ' โ 'แแ-แ-แแ-แแ'
- 1 consonant (V-C-V): Split after first vowel
'แแแแ' โ 'แแ-แแ'
- 2+ consonants (V-CC...C-V):
- Check for gemination (double consonants) - rare in Georgian
'แกแแแแ' โ 'แกแแ-แแ' # Split between double 'แ' (if exists)
- Check for harmonic clusters
'แแแแแ' โ 'แแแ-แแ' # Keep 'แแ' together
- Default: Split after first consonant
'แแแ แแแ แ' โ 'แแแ -แแ-แ แ'
3. Harmonic Clusters (62 clusters)
These consonant pairs stay together:
แแ, แแ , แแฆ, แแ, แแ, แแ, แแ, แแ, แแ, แแ, แแ , แแ , แแ, แแ , แแฆ,
แแ, แแ, แแ, แแ , แแ, แแข, แแ, แแ , แแฆ, แ แ, แ แ, แ แ, แกแฌ, แกแฎ, แขแ,
แขแ, แขแ , แคแ, แคแ , แคแฅ, แคแจ, แฅแ, แฅแ, แฅแ, แฅแ , แฆแ, แฆแ , แงแ, แงแ , แจแ,
แจแ, แฉแฅ, แฉแ , แชแ, แชแ, แชแ , แชแ, แซแ, แซแ, แซแฆ, แฌแ, แฌแ , แฌแ, แฌแ, แญแ,
แญแ , แญแง, แฎแ, แฎแ, แฎแ, แฎแ, แฏแ
4. Anti-Orphan Protection
Minimum 2 characters on each side:
'แแ แ' โ 'แแ แ' # Not split (would create 1-letter syllable)
'แแ แแ' โ 'แ-แ แ-แ' # OK to split
๐งช Examples
Basic Words
hyphenate('แกแแฅแแ แแแแแ') # โ แกแ-แฅแแ -แแแ-แแ
hyphenate('แแแแแ แแแ') # โ แแแแ-แ แ-แแ
hyphenate('แแแแแฅแแแแฅแ') # โ แแ-แแ-แฅแ-แแ-แฅแ
hyphenate('แแแ แแแแแแขแ') # โ แแแ -แแ-แแแ-แขแ
V-C-V Pattern (Single Consonant)
hyphenate('แแแแกแ') # โ แแแ-แกแ
hyphenate('แแแกแ') # โ แแ-แกแ
hyphenate('แแแแ') # โ แแ-แแ
hyphenate('แแแแ') # โ แแ-แแ
Harmonic Clusters
hyphenate('แแแแแ') # โ แแแ-แแ (keeps แแ)
hyphenate('แแ แแแ') # โ แแ แ-แแ (keeps แแ )
hyphenate('แแแแฎแ') # โ แแแ-แฎแ (keeps แแ)
hyphenate('แขแ แแแแแ') # โ แขแ แแ-แแ-แ (keeps แขแ )
hyphenate('แแ แแแ แแแ') # โ แแ แแ-แ แ-แแ (keeps แแ and แแ )
V-V Split
hyphenate('แแแแแแแ') # โ แแ-แ-แแ-แแ
hyphenate('แแแแแ แ') # โ แแ-แ-แ-แ แ
hyphenate('แแแจแแแ') # โ แ-แ-แจแ-แแ
hyphenate('แแแแแแแแแ') # โ แแ-แ-แแ-แแ-แแ
Complex Words
hyphenate('แแแแแ แแแ') # โ แแแแ-แ แ-แแ
hyphenate('แกแแแแแแ แแแ') # โ แกแแ-แแแ-แ แ-แแ
hyphenate('แแแ แแแ แ') # โ แแแ -แแ-แ แ
hyphenate('แแกแขแ แแแแแแ') # โ แแก-แขแ แ-แแ-แแ-แ
Text Processing
text = 'แกแแฅแแ แแแแแ แแ แแก แแแแแแ แฅแแแงแแแ'
hyphenate_text(text)
# โ 'แกแยญแฅแแ ยญแแแยญแแ แแ แแก แแยญแแยญแแ แฅแแยญแงแยญแแ'
# Preserves punctuation
text = 'แแแแแ แแแ, แแแ แแแแแแขแ แแ แกแแกแแแแ แแแ.'
hyphenate_text(text)
# โ 'แแแแยญแ แยญแแ, แแแ ยญแแยญแแแยญแขแ แแ แกแยญแกแยญแแแ ยญแแแ.'
# Preserves numbers and Latin text
text = 'แกแแฅแแ แแแแแแจแ 2025 แฌแแแก'
hyphenate_text(text)
# โ 'แกแยญแฅแแ ยญแแแยญแแยญแจแ 2025 แฌแแแก'
Get Syllables
get_syllables('แกแแฅแแ แแแแแ') # โ ['แกแ', 'แฅแแ ', 'แแแ', 'แแ']
get_syllables('แแแแแฅแแแแฅแ') # โ ['แแ', 'แแ', 'แฅแ', 'แแ', 'แฅแ']
get_syllables('แแแแแ แแแ') # โ ['แแแแ', 'แ แ', 'แแ']
get_syllables('แแแแแ') # โ ['แแแ', 'แแ']
๐ Dictionary
The library includes data/exceptions.json with 150+ Georgian words that require special hyphenation:
{
"แแแแแแฃแขแแ แ": "แแแ-แแแฃ-แขแ-แ แ",
"แแแขแแ แแแขแ": "แแ-แขแแ -แแ-แขแ",
"แกแแฅแแ แแแแแ": "แกแ-แฅแแ -แแแ-แแ",
"แแ แแแ แแแ": "แแ แแ-แ แ-แแ",
"แแแแแ แแแ": "แแแแ-แ แ-แแ"
}
Load it with:
hyphenator.load_default_library()
๐ง API Reference
Class: GeorgianHyphenator
class GeorgianHyphenator:
def __init__(self, hyphen_char: str = '\u00AD')
Parameters:
hyphen_char(str): Character to use for hyphenation. Default: soft hyphen\u00AD
Methods
hyphenate(word: str) โ str
Hyphenate a single Georgian word.
hyphenator = GeorgianHyphenator('-')
result = hyphenator.hyphenate('แกแแฅแแ แแแแแ')
# Returns: 'แกแ-แฅแแ -แแแ-แแ'
hyphenate_text(text: str) โ str
Hyphenate entire text (preserves punctuation and non-Georgian characters).
hyphenator = GeorgianHyphenator('-')
result = hyphenator.hyphenate_text('แกแแฅแแ แแแแแ แแ แแก แแแแแแ')
# Returns: 'แกแ-แฅแแ -แแแ-แแ แแ แแก แแ-แแ-แแ'
get_syllables(word: str) โ List[str]
Get syllables as a list.
hyphenator = GeorgianHyphenator('-')
syllables = hyphenator.get_syllables('แกแแฅแแ แแแแแ')
# Returns: ['แกแ', 'แฅแแ ', 'แแแ', 'แแ']
load_library(data: Dict[str, str]) โ None
Load custom dictionary.
hyphenator.load_library({
'แกแแขแงแแ': 'แกแ-แขแงแแ',
'แแแแแแแแ': 'แแ-แแ-แแ-แแ'
})
load_default_library() โ None
Load default exception dictionary from data/exceptions.json.
hyphenator.load_default_library()
Convenience Functions
hyphenate(word: str, hyphen_char: str = '\u00AD') โ str
from georgian_hyphenation import hyphenate
result = hyphenate('แกแแฅแแ แแแแแ', '-')
get_syllables(word: str) โ List[str]
from georgian_hyphenation import get_syllables
syllables = get_syllables('แกแแฅแแ แแแแแ')
hyphenate_text(text: str, hyphen_char: str = '\u00AD') โ str
from georgian_hyphenation import hyphenate_text
result = hyphenate_text('แกแแฅแแ แแแแแ แแ แแก แแแแแแ')
to_tex_pattern(word: str) โ str
from georgian_hyphenation import to_tex_pattern
pattern = to_tex_pattern('แกแแฅแแ แแแแแ')
# Returns: '.แกแ1แฅแแ 1แแแ1แแ.'
to_hunspell_format(word: str) โ str
from georgian_hyphenation import to_hunspell_format
hunspell = to_hunspell_format('แกแแฅแแ แแแแแ')
# Returns: 'แกแ=แฅแแ =แแแ=แแ'
๐งช Testing
Run the test suite:
python test_python.py
Expected output:
๐งช Georgian Hyphenation v2.2.1 - Python Tests
๐ Basic Hyphenation Tests:
โ
Test 1: แกแแฅแแ แแแแแ
Result: แกแ-แฅแแ -แแแ-แแ
...
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Test Results: 13 passed, 0 failed
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ All tests passed!
๐ Project Structure
georgian-hyphenation/
โโโ data/
โ โโโ exceptions.json # Dictionary (150+ words)
โโโ src/
โ โโโ georgian_hyphenation/
โ โโโ __init__.py # Package init
โ โโโ hyphenator.py # Main code
โโโ test_python.py # Test suite
โโโ pyproject.toml # Package config
โโโ MANIFEST.in # Data files manifest
โโโ README.md # This file
โโโ LICENSE.txt # MIT License
๐ Changelog
v2.2.1 (2025-01-27)
- โจ Optimized: Set-based harmonic cluster lookup (O(1) instead of O(n))
- โจ Added 12 new harmonic clusters: แแ , แแ , แแ , แแฆ, แแข, แจแ, แฉแ , แฌแ, แญแง
- ๐ Strip & Re-hyphenate: Always removes old hyphens and reapplies correctly
- ๐ฆ Dictionary: 150+ exception words in
data/exceptions.json - ๐ฏ Hybrid Engine: Dictionary-first, Algorithm fallback
- ๐ Improved documentation with detailed API reference
v2.0.0 (2024)
- Initial release
- Phonological algorithm
- Basic harmonic cluster handling
- TeX and Hunspell export formats
๐ค Contributing
Contributions are welcome! To contribute:
- Fork the repository: https://github.com/guramzhgamadze/georgian-hyphenation
- Create a feature branch:
git checkout -b feature/new-feature - Make your changes
- Run tests:
python test_python.py - Commit:
git commit -m 'Add new feature' - Push:
git push origin feature/new-feature - Open a Pull Request
Adding Exception Words
To add words to the dictionary:
- Edit
data/exceptions.json - Add your word in format:
"แกแแขแงแแ": "แกแ-แขแงแแ" - Test:
python test_python.py - Submit PR
๐ Bug Reports
Found a bug? Please open an issue: https://github.com/guramzhgamadze/georgian-hyphenation/issues
Include:
- Python version
- Code snippet that reproduces the issue
- Expected vs actual output
๐ License
MIT License
Copyright (c) 2025 Guram Zhgamadze
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
๐จโ๐ป Author
Guram Zhgamadze
- GitHub: @guramzhgamadze
- Email: guramzhgamadze@gmail.com
- PyPI: georgian-hyphenation
๐ Acknowledgments
- Georgian linguistic research on syllabification
- TeX hyphenation algorithm inspiration
- Python community for excellent packaging tools
๐ Related Projects
- Hyphen - Generic hyphenation library
- PyHyphen - Python wrapper for Hyphen
- TeX hyphenation patterns
โญ Support
If you find this library useful, please:
- โญ Star the repository on GitHub
- ๐ข Share with others
- ๐ Report bugs
- ๐ก Suggest improvements
Made with โค๏ธ for the Georgian language community
๐ฌ๐ช แฅแแ แแฃแแ แแแแก แชแแคแ แฃแแ แแแแแแแแ แแแแกแแแแก
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file georgian_hyphenation-2.2.2.tar.gz.
File metadata
- Download URL: georgian_hyphenation-2.2.2.tar.gz
- Upload date:
- Size: 23.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ba9cee7cc20ced528d4d3ecf7e9ba2988839bf926ae5bf857d79b84a8dbf237
|
|
| MD5 |
b0ba54db2ffd637d418e70972d262806
|
|
| BLAKE2b-256 |
fbd472353895c5427eef1afa50dfca0e047fb1b7fbb6be371575a126f78acaa0
|
File details
Details for the file georgian_hyphenation-2.2.2-py3-none-any.whl.
File metadata
- Download URL: georgian_hyphenation-2.2.2-py3-none-any.whl
- Upload date:
- Size: 12.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4682ffc25634657228890d0f17c52ff571bef0a23e3428e3bc8251dd909fa8f
|
|
| MD5 |
89884321fec535df45d32d20a32514aa
|
|
| BLAKE2b-256 |
99c9abb9b94e9897bd52206ccfdec147dd3015d2e384749db13c57fde69dd6fb
|