Skip to main content

Python library for UTF to WX conversion and vice-versa for Indian languages.

Project description

Build Status Coverage Status CircleCI

Python library for UTF to WX conversion and vice-versa for Indian languages.

Installation

pip install wxconv

Examples

1. work with files:

wxconv --f ssf --t intra --n --l hin --s utf --i hin-utf.ssf --o hin-wx.ssf

-l , --language     select language (3 letter ISO-639 code)
                                    Hindi       : hin
                                    Telugu      : tel
                                    Tamil       : tam
                                    Malayalam   : mal
                                    Kannada     : kan
                                    Bengali     : ben
                                    Oriya       : ori
                                    Punjabi     : pan
                                    Marathi     : mar
                                    Nepali      : nep
                                    Gujarati    : guj
                                    Bodo        : bod
                                    Konkani     : kok
                                    Assamese    : asm
                                    Urdu        : urd
-s , --source-enc   {utf, wx} select input-file encoding
-f , --format       {text, ssf, conll, bio, tnt} select input-file format
-t , --ssf-type     {inter, intra} specify ssf-type if file format (-f) is ssf
-n, --nested        set this flag for nested ssf
-m, --no-mask       set this flag to keep off masking of roman strings in Indic text
-i , --input        <input-file>
-o , --output       <output-file>
-z, --normalize     set this flag for utf normalizations without WX-Conversion

2. utf to wx (plain text):

>>> from wxconv import WXC
>>>
>>> # class WXC(order="utf2wx", format_="text", lang="hin")
... # Parameters: order:str, (default="utf2wx"), source2target encoding [wx2utf|utf2wx]
... #             format_:str, (default="text"), output format [text|conll|bio|tnt]
... #             lang:str, (default="hin"), language parameter [hin|tel|...] (3 letter ISO-639 code)

>>> con = WXC(order='utf2wx')  # here default language is hindi and default format is text
>>>
>>> hin = u"""
... 1   देश के कई हिस्सों में सूखे के आसार उत्पन्न हो गए हैं
... 2   अब तक मौसम विभाग सामान्य बारिश होने की अपनी भविष्यवाणी पर अड़ा हुआ था लेकिन अब यह दावा पूरी तरह से खारिज हो गया है
... 3   देश भर में अब तक हुई बारिश औसत से छह फीसदी कम है जबकि विभाग का दावा था कि इसमें ५ फीसदी से ज्यादा कमी नहीं होगी
... 4   इसके चलते उत्तर प्रदेश पंजाब हरियाणा राजस्थान बिहार झारखंड आदि राज्य लगभग सूखे की चपेट में हैं
... 5   लेकिन तकनीकी कारणों से इन्हें अभी सूखाग्रस्त घोषित नहीं किया गया है
... """
>>>
>>> print(con.convert(hin))

1   xeSa ke kaI hissoM meM sUKe ke AsAra uwpanna ho gae hEM
2   aba waka mOsama viBAga sAmAnya bAriSa hone kI apanI BaviRyavANI para adZA huA WA lekina aba yaha xAvA pUrI waraha se KArija ho gayA hE
3   xeSa Bara meM aba waka huI bAriSa Osawa se Caha PIsaxI kama hE jabaki viBAga kA xAvA WA ki isameM 5 PIsaxI se jyAxA kamI nahIM hogI
4   isake calawe uwwara praxeSa paMjAba hariyANA rAjasWAna bihAra JAraKaMda Axi rAjya lagaBaga sUKe kI capeta meM hEM
5   lekina wakanIkI kAraNoM se inheM aBI sUKAgraswa GoRiwa nahIM kiyA gayA hE

>>> tel = u"""
... 1   తమ 35 ఏళ్ల పెళ్లి సందర్భంలోనూ - అనుక్షణం శత్రువులకు మల్లే కాట్లాడుకోవటం బంటీకి నచ్చదు.
... 2   2007లో అజ్మీర్‌లోని ఖాజా మొయినుద్దీన్ చిష్తీ దర్గాలో జరిగిన పేలుడులో ముగ్గురు చనిపోగా, మరో 15 మంది గాయపడిన విషయం తెలిసిందే.
... 3   గణితం లాంటి విషయం గురించి బ్లాగులు, వీకేల ద్వారా చర్చలు జరగవచ్చునని అందరికీ అర్థమయింది.
... 4   ఇలా ఇంత పోటీలో, యాంటీ బ్రిటిష్ వాతావరణంలోను 155 నిముషాల నిడివిగల ‘హామ్లెట్’ నిలిచి గెలిచిందంటే అది దాని చక్కదనానికి నిదర్శనమే!
... 5   అవకాశం వచ్చిన వారికి ఎక్స్‌పోజర్‌కూడా వస్తుంది.
... """
>>>
>>> con = WXC(order='utf2wx', lang='tel')
>>>
>>> print(con.convert(tel))

1   wama 35 elYla peVlYli saMxarBaMlonU - anukRaNaM Sawruvulaku malle kAtlAdukovataM baMtIki naccaxu.
2   2007lo ajmIrloni KAjA moVyinuxxIn ciRwI xargAlo jarigina peludulo mugguru canipogA, maro 15 maMxi gAyapadina viRayaM weVlisiMxe.
3   gaNiwaM lAMti viRayaM guriMci blAgulu, vIkela xvArA carcalu jaragavaccunani aMxarikI arWamayiMxi.
4   ilA iMwa potIlo, yAMtI britiR vAwAvaraNaMlonu 155 nimuRAla nidivigala hAmleVt nilici geVliciMxaMte axi xAni cakkaxanAniki nixarSaname!
5   avakASaM vaccina vAriki eVkspojarkUdA vaswuMxi.

>>>

3. wx to utf:

>>> con = WXC(order='wx2utf', lang='hin')
>>>
>>> hin = """
... 1   xeSa ke kaI hissoM meM sUKe ke AsAra uwpanna ho gae hEM
... 2   aba waka mOsama viBAga sAmAnya bAriSa hone kI apanI BaviRyavANI para adZA huA WA lekina aba yaha xAvA pUrI waraha se KArija ho gayA hE
... 3   xeSa Bara meM aba waka huI bAriSa Osawa se Caha PIsaxI kama hE jabaki viBAga kA xAvA WA ki isameM 5 PIsaxI se jyAxA kamI nahIM hogI
... 4   isake calawe uwwara praxeSa paMjAba hariyANA rAjasWAna bihAra JAraKaMda Axi rAjya lagaBaga sUKe kI capeta meM hEM
... 5   lekina wakanIkI kAraNoM se inheM aBI sUKAgraswa GoRiwa nahIM kiyA gayA hE
... """
>>>
>>> print(con.convert(hin))

1   देश के कई हिस्सों में सूखे के आसार उत्पन्न हो गए हैं
2   अब तक मौसम विभाग सामान्य बारिश होने की अपनी भविष्यवाणी पर अड़ा हुआ था लेकिन अब यह दावा पूरी तरह से खारिज हो गया है
3   देश भर में अब तक हुई बारिश औसत से छह फीसदी कम है जबकि विभाग का दावा था कि इसमें 5 फीसदी से ज्यादा कमी नहीं होगी
4   इसके चलते उत्तर प्रदेश पंजाब हरियाणा राजस्थान बिहार झारखंड आदि राज्य लगभग सूखे की चपेट में हैं
5   लेकिन तकनीकी कारणों से इन्हें अभी सूखाग्रस्त घोषित नहीं किया गया है

>>>

4. work with conll:

>>> con = WXC(order='utf2wx', lang='hin', format_='conll')
>>>
>>> conll = u"""
... 1       इसकी     यह      pn      PRP     cat-pn|gen-f|num-sg|pers-3|case-o|vib-का|tam-kA|chunkId-NP|chunkType-head|stype-|voicetype-      2     r6      _       _
... 2       ऊँचाई     ऊँचाई     n       NN      cat-n|gen-f|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP2|chunkType-head|stype-|voicetype-       6     k1      _       _
... 3       केवल     केवल     avy     RP      cat-avy|gen-|num-|pers-|case-|vib-|tam-|chunkId-NP3|chunkType-child|stype-|voicetype-   4       lwg__rp _       _
... 4       1982    1982    num     QC      cat-num|gen-any|num-any|pers-|case-any|vib-|tam-|chunkId-NP3|chunkType-child|stype-|voicetype-  5       nmod__adj       _       _
... 5       मीटर     मीटर     n       NN      cat-n|gen-m|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP3|chunkType-head|stype-|voicetype-       6     k1s     _       _
... 6       है       है       v       VM      cat-v|gen-any|num-sg|pers-3|case-|vib-है|tam-hE|chunkId-VGF|chunkType-head|stype-declarative|voicetype-active    0       root    _       _
... 7       ।       ।       punc    SYM     cat-punc|gen-|num-|pers-|case-|vib-|tam-|chunkId-BLK|chunkType-head|stype-|voicetype-   6       rsym    _       _"""
>>>
>>> print(con.convert(conll))

1   isakI   yaha    pn      PRP cat-pn|gen-f|num-sg|pers-3|case-o|vib-kA|tam-kA|chunkId-NP|chunkType-head|stype-|voicetype- 2   r6  _   _
2   UzcAI   UzcAI   n       NN  cat-n|gen-f|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP2|chunkType-head|stype-|voicetype-   6   k1  _   _
3   kevala  kevala  avy     RP  cat-avy|gen-|num-|pers-|case-|vib-|tam-|chunkId-NP3|chunkType-child|stype-|voicetype-   4   lwg__rp_    _
4   1982    1982    num     QC  cat-num|gen-any|num-any|pers-|case-any|vib-|tam-|chunkId-NP3|chunkType-child|stype-|voicetype-  5   nmod__adj   _   _
5   mItara  mItara  n       NN  cat-n|gen-m|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP3|chunkType-head|stype-|voicetype-   6   k1s _   _
6   hE      hE  v   VM  cat-v|gen-any|num-sg|pers-3|case-|vib-hE|tam-hE|chunkId-VGF|chunkType-head|stype-declarative|voicetype-active   0   root    _   _
7   .       .   punc    SYM cat-punc|gen-|num-|pers-|case-|vib-|tam-|chunkId-BLK|chunkType-head|stype-|voicetype-   6   rsym    __
>>>

5. work with tnt:

>>> tnt = u"""
... यों       RB
... सिंगल     JJ
... स्क्रीन    NNC
... थिएटर    NNP
... के       PSP
... दर्शकों    NN
... को       PSP
... अग्निपथ   NNP
... अधिक     QF
... नहीं      NEG
... भा       VM
... सकी      VAUX
... ।       SYM
... """
>>>
>>> con = WXC(order='utf2wx', lang='hin', format_='tnt')
>>>
>>> print(con.convert(tnt))

yoM RB
siMgala     JJ
skrIna      NNC
Wietara     NNP
ke  PSP
xarSakoM    NN
ko  PSP
agnipaWa    NNP
aXika       QF
nahIM       NEG
BA  VM
sakI        VAUX
.   SYM

>>>

6. work with bio:

same as tnt or conll

7. work with ssf:

implemented for ssf files (intra, inter and nested)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wxconv-1.0.0.0.tar.gz (553.8 kB view details)

Uploaded Source

Built Distribution

wxconv-1.0.0.0-py2.py3-none-any.whl (5.8 MB view details)

Uploaded Python 2Python 3

File details

Details for the file wxconv-1.0.0.0.tar.gz.

File metadata

  • Download URL: wxconv-1.0.0.0.tar.gz
  • Upload date:
  • Size: 553.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/0.12.11 CPython/2.7.12 Linux/4.15.0-43-generic

File hashes

Hashes for wxconv-1.0.0.0.tar.gz
Algorithm Hash digest
SHA256 9e18cf42e7bc8145c3c42d54a57b45329f18fd407a3d3aeba4d00b0131f8a715
MD5 a06ad9751ec7575e2b01d54202614889
BLAKE2b-256 8042d29a46a49ab6c674adc2156cc2a1792ffe8feb70b122d3cc9adfad9c715a

See more details on using hashes here.

File details

Details for the file wxconv-1.0.0.0-py2.py3-none-any.whl.

File metadata

  • Download URL: wxconv-1.0.0.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.8 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/0.12.11 CPython/2.7.12 Linux/4.15.0-43-generic

File hashes

Hashes for wxconv-1.0.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ecaf68ba1b07d9453b819a602e4e8e648b46240f6fc1cec9a3b8f6ca9281aa19
MD5 4fae3bcc715d4ec607d0bc4d9f9684a8
BLAKE2b-256 810048af3164406d379f2edc95414b58d0250f814e6bc23c554696691ae0e1a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page