Python module/script to map Nepali ASCII font faces like Preeti, Sagarmatha, and more to devanagari unicode with direct mapping support for docx and txt files.
Project description
npTTF2UTF
Python module/script to map Nepali ASCII font faces like Preeti, Sagarmatha, and more to devanagari unicode
This is mainly a python module to help in mapping of various nepali ASCII font faces to its unicode counterpart and unicode or any unicode mappable font back to Preeti font face. It currently supports directly mapping passed strings (fontmapper.py) as well font auto detection and selecting components to map for mapping for docx files (docxhandler.py) and plain text files (txthandler.py).
Requirements
- python3
Installation
1) Using pip - Pre packaged
$ pip install npttf2utf
or
$ pip3 install npttf2utf
Or you can package it locally and install. Useful if you want to change something before installing
2) Using pip - Packaged manually
$ pip install build
or
$ pip3 install build
$ git clone https://github.com/trippygeese/npttf2utf.git
$ cd npttf2utf
$ python3 -m build
This will create a .whl file for installation in 'dist' directory. To install
$ cd dist
$ pip install *.whl
or
$ pip3 install *.whl
Usage
1) As CLI application
$ npttf2utf -m [mode] -if [origin font] -of [output font] -dc [docx components] -kf [extra known unicode fonts] -i [input file/string] -o [output file] -mf [mapping defination]
Parameters (Same for both) :
Parameter | Help/Usage |
---|---|
-h* | Shows help and information about the program |
-v* | Shows version information |
-m | Usage mode. "string" to pass input string and output on console window, "docx" for working on docx files and "plain" for working with plaintext files |
-if | The font face which was used for the string or creating the file. In "docx" mode you can use "auto" to autodetect used fonts and map them |
-of* | The font to which the string or file will be mapped to. Currently, supports 'unicode' and 'Preeti' . Defaults to 'Preeti' if unspecified |
-dc* | The components of docx which will be processed during mapping. Components are separated by a comma ',' (Defaults to all supported components 'body_paragraph,table,shape') |
-kf* | While converting docx to Preeti known nepali unicode will be mapped to preeti and other fonts will be ignored. Specify fonts seperated by comma to add more fonts to default list "Kalimati,Mangal,Noto Sans Devanagari"' |
-i | Input string or path to input file |
-o* | Path to output file. Not required for "string" mode |
-mf* | Path to mapping definition file. If not passed it will look for "map.json" in current script directory |
Note: The parameters marked with * are optional
Example usage:
- To pass string in Preeti in terminal and map it to unicode ("string" mode)
$ npttf2utf -m string -if Preeti -of unicode -i "asdfghjk"
It will map "asdfghjk" to unicode following mapping for Preeti and output "बकमानजवप"
.
- To pass string in unicode in terminal and map it to preeti ("string" mode)
$ npttf2utf -m string -if unicode -of Preeti -i "सबिन आचार्य"
It will map "सबिन आचार्य" to Preeti and output ";lag cfrf/\o" (You can also convert other non unicode fonts like Kantipur, Sagarmatha etc to Preeti)
.
- To convert docx or txt file ("plain"/"docx" mode) to unicode
$ npttf2utf -m docx -if auto -of unicode -i "document_with_ASCII_font_faces.docx" -o "document_mapped_to_unicode.docx"
It will map the content of document to unicode and save it as "document_mapped_to_unicode.docx" ("auto" as font is available for "docx" mode only)
- To convert docx or txt file ("plain"/"docx" mode) to Preeti
$ npttf2utf -m docx -if auto -of Preeti -i "document_with_ASCII_font_faces_or_unicode.docx" -o "document_mapped_to_preeti.docx" -dc "body_paragraph,table,shape" -kf "some,extra,unicode,fonts"
It will map the content of document to unicode and save it as "document_mapped_to_unicode.docx" ("auto" as font is available for "docx" mode only)
2) As python module
$ python
>> import npttf2utf
Class: npttf2utf.FontMapper
"npttf2utf.FontMapper" class can be used to map the fonts to their unicode counterpart. It is also the base for other document converters
Method: __init __
This method initializes the FontMapper class
def __init__(self, map_json):
Returns: None
Argument | Description | Optional |
---|---|---|
map_json | Path to mapping definition file (Must be readable by current user) | False |
Method: map_to_unicode
This method maps the passed string to of defined origin font to unicode using the mapping definition
def map_to_unicode(self, string, from_font="Preeti", unescape_html=False):
Returns: String
Argument | Description | Optional |
---|---|---|
string | String to map | False |
from_font | The origin font in which string was written. Defaults to "Preeti" if not passed | True |
unescape_html_input | Unescape HTML entities in input string before mapping them (Bool) (Defaults to False) | True |
escape_html_output | Escape HTML entities in mapped string before returning them (Bool) (Defaults to False) | True |
Example usage:
>> import npttf2utf
>> mapper = npttf2utf.FontMapper("npttf2utf/map.json")
>> mapper.map_to_unicode("asdfghjk", from_font="Preeti", unescape_html_input=False, escape_html_output=False)
बकमानजवप
>>
Method: map_to_preeti
This method maps the passed Devanagari unicode string to preeti using the preetimapper.py
def map_to_preeti(self, string, from_font="Preeti", unescape_html=False):
Returns: String
Argument | Description | Optional |
---|---|---|
string | String to map | False |
from_font | The origin font in which string was written. Defaults to "Preeti" if not passed | True |
unescape_html_input | Unescape HTML entities in input string before mapping them (Bool) (Defaults to False) | True |
escape_html_output | Escape HTML entities in mapped string before returning them (Bool) (Defaults to False) | True |
Example usage:
>> import npttf2utf
>> mapper = npttf2utf.FontMapper("npttf2utf/map.json")
>> mapper.map_to_preeti("सबिन आचार्य", from_font="unicode", unescape_html_input=False, escape_html_output=False)
;lag cfrf/\o
>>
Class: npttf2utf.DocxHandler
"npttf2utf.DocxHandler" class can be used to map docx files to unicode and save them
Method: __init __
This method initializes the DocxHandler class which can be used to map docx files
def __init__(self, rules_file, default_unicode_font_name="Kalimati"):
Returns: None
Argument | Description | Optional |
---|---|---|
rules_file | Path to mapping definition file (Must be readable by current user) | False |
default_unicode_font_name | The name of font which will be set for a converted segment of docx files. (Defaults to "Kalimati") | True |
Method: detect_used_fonts
This method returns list of fonts supported by mapping definition which are used in the docx file
def detect_used_fonts(self, docx_file_path):
Returns: List
Argument | Description | Optional |
---|---|---|
docx_file_path | Path to docx file whose fonts are to be detected | False |
Method: map_fonts
This method maps the font in docx file and creates new docx file with mapping applied
def map_fonts(self, orginal_file_path, output_file_path="mapped.docx", from_font="auto", to_font="unicode", components=["body_paragraph", "table", "shape"], known_unicode_fonts=[]):
Returns: None
Argument | Description | Optional |
---|---|---|
original_file_path | Path to docx file whose fonts are to be mapped | False |
output_file_path | Path where the mapped docx file is to saved (Defaults to "mapped.docx") | True |
from_font | The origin font in which string was written. (Defaults to "auto"). "auto" can be passed to detect used font automatically and map them accordingly and leave english characters untouched | True |
to_font | Target for font conversion. (Defaults to "unicode"). Only "unicode" is supported as of now | True |
components | [List] List of components of docx file which will be looked up for text contents. (Defaults to: ["body_paragraph", "table", "shape"]). "body_paragraph", "table" and "shape" are supported as of now | True |
known_unicode_fonts | [List] List of extra nepali unicode font that when detected will be mapped (Only used while mapping to Preeti) | True |
Example usage:
>> import npttf2utf
>> converter = npttf2utf.DocxHandler("npttf2utf/map.json", default_unicode_font_name="Kalimati")
>> converter.detect_used_fonts("document_with_ASCII_font_faces.docx")
["Preeti", "Sagarmaths"]
>> converter.map_fonts("document_with_ASCII_font_faces.docx", output_file_path="mapped_document.docx", from_font="auto", to_font="unicode", components=["body_paragraph", "table"], known_unicode_fonts=["some", "extra", "nepali", "font"])
>>
Class: npttf2utf.TxtHandler
"npttf2utf.TxtHandler" class can be used to map plain text files to unicode and save them
Method: __init __
This method initializes the TxtHandler class which can be used to map txt files
def __init__(self, rules_file):
Returns: None
Argument | Description | Optional |
---|---|---|
rules_file | Path to mapping definition file (Must be readable by current user) | False |
Method: map_fonts
This method maps the font in txt file and creates new txt file with mapping applied
def map_fonts(self, orginal_file_path, output_file_path="mapped.txt", from_font="Preeti", to_font="unicode", components=[], known_unicode_fonts=[]):
Returns: None
Argument | Description | Optional |
---|---|---|
original_file_path | Path to txt file whose fonts are to be mapped | False |
output_file_path | Path where the mapped txt file is to saved (Defaults to "mapped.txt") | True |
from_font | The origin font in which string was written. (Defaults to "Preeti"). | True |
to_font | Target for font conversion. (Defaults to "unicode"). Only "unicode" is supported as of now | True |
components | Serves no purpose, just there to match the method call of DocxHandler | True |
known_unicode_fonts | Serves no purpose, just there to match the method call of DocxHandler | True |
>> import npttf2utf
>> converter = npttf2utf.TxTHandler("npttf2utf/map.json")
>> converter.map_fonts("txt_with_ASCII_font_faces.txt", output_file_path="mapped_txt.txt", from_font="Preeti", to_font="unicode", components=[], known_unicode_fonts=[])
>>
Supported docx components
- Text content in Text boxes/Shapes
- General paragraphs
- Text content in table
Supported ASCII font faces
- Preeti
- Sagarmatha
- Kantipur
- FONTASY_HIMALI_TT
- PCS NEPALI
Supported Output fonts
- Devanagari Unicode
Todos
- Add support for headers/footers
- Optimize the code
- Ability to unify fonts to Preeti as well
Adding support for new file type
Create a new file to handle the file type. (You can use docxhandler.py and modify it as needed). The class constructor should take map definition as the first parameter and the file handler class should contain "map_fonts" methods that take original user file, the path for converted file, from the font, to font, and a list of components as arguments. You can map a string to Unicode by using FontMapper.map_to_unicode if the mapping for origin font exists in the definition. (map_to_unicode takes unescape_html argument to that can be used to unescape HTML string before processing and escape it before returning)
Adding mapping for a new font
Open "src/npttf2utf/map.json" and add a JSON key with this structure
"font_name":{
"version": "v1",
"rules": {
"character-map": {
"character-in-origin-font": "unicode-equivalent-character"
},
"pre-rules": [
["regex-string-search", "regex-string-replace"]
],
"post-rules": [
["regex-string-search", "regex-string-replace"]
]
}
}
pre-rules - Regex find and replace to apply before mapping characters to Unicode character map - Directly mappable character from source font to Unicode. For Preeti: a <-> ब post-rules - The words may not be as expected directly after mapping. So this contains regexes to find them and replace them with corrections (Regex find and replace to apply before mapping characters to Unicode)
Feel free to use this project for any purpose and long as you comply with the license. Any contribution to the project is highly appreciated. If you find any bugs please report it
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file npttf2utf-0.3.7.tar.gz
.
File metadata
- Download URL: npttf2utf-0.3.7.tar.gz
- Upload date:
- Size: 32.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8b5d97a0c181fc9293e67dcb17a97d256c6246a0f30a0eae49273ec3eae608d |
|
MD5 | ad1af699f149fc00b3742ee2e1a48aed |
|
BLAKE2b-256 | 754ea654fd80629b45acd84c6c6fcc291935ed8c78d1ec4e01316be7589f37d6 |
File details
Details for the file npttf2utf-0.3.7-py3-none-any.whl
.
File metadata
- Download URL: npttf2utf-0.3.7-py3-none-any.whl
- Upload date:
- Size: 29.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 671e85626667dc81bbe3f78bd55ab2e2c468f91bd24da186838e836c94b395df |
|
MD5 | 193b15be331725714e015cdb35264249 |
|
BLAKE2b-256 | 150469cfc4b383e109fb2976febbe40eaa482ca874aa78a4d6dc19078ad50868 |