Skip to main content

Normalizer for bengali / english text.

Project description

normalizer

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

Installation

$ pip install git+https://github.com/csebuetnlp/normalizer

Usage

from normalizer import normalize
input_text = """your input text"""
normalized_text = normalize(
    input_text,
    unicode_norm="NFKC",          # type of unicode normalization (default "NFKC")
    punct_replacement=None,       # an optional string or callable for replacing the punctuations (default `None`, i.e. no replacement)
    url_replacement=None,         # an optional string or callable for replacing the URLS (default `None`, i.e. no replacement)
    emoji_replacement=None,       # an optional string or callable for replacing the emojis (default `None`, i.e. no replacement)
    apply_unicode_norm_last=True  # whether to apply the unicode normalization before or after rule based replacements (default True)        
)

License

Contents of this repository are restricted to non-commercial research purposes only under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Creative Commons License

Citation

If you use this module in your work, please cite the following paper:

@inproceedings{hasan-etal-2020-low,
    title = "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for {B}engali-{E}nglish Machine Translation",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Samin, Kazi  and
      Hasan, Masum  and
      Basak, Madhusudan  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.207",
    doi = "10.18653/v1/2020.emnlp-main.207",
    pages = "2612--2623",
    abstract = "Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. Most publicly available parallel corpora for Bengali are not large enough; and have rather poor quality, mostly because of incorrect sentence alignments resulting from erroneous sentence segmentation, and also because of a high volume of noise present in them. In this work, we build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups: aligner ensembling and batch filtering. With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs, more than 2 million of which were not available before. Training on neural models, we achieve an improvement of more than 9 BLEU score over previous approaches to Bengali-English machine translation. We also evaluate on a new test set of 1000 pairs made with extensive quality control. We release the segmenter, parallel corpus, and the evaluation set, thus elevating Bengali from its low-resource status. To the best of our knowledge, this is the first ever large scale study on Bengali-English machine translation. We believe our study will pave the way for future research on Bengali-English machine translation as well as other low-resource languages. Our data and code are available at https://github.com/csebuetnlp/banglanmt.",
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csebuetnlp_normalizer-1.0.0.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

csebuetnlp_normalizer-1.0.0-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file csebuetnlp_normalizer-1.0.0.tar.gz.

File metadata

  • Download URL: csebuetnlp_normalizer-1.0.0.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for csebuetnlp_normalizer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 26d7bfbca19dbc55eb9ad55275daf2a2a146adc4db0287769934dfafd662ac29
MD5 5ff04a6ea7031885fec037d66f5d110e
BLAKE2b-256 14f932501ccab047587a2b8ea6cec9de38409d1d346cdc60843e786735faaf13

See more details on using hashes here.

File details

Details for the file csebuetnlp_normalizer-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for csebuetnlp_normalizer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9e5eee65ea449573dc9f8a703fe97e5a08ab2f4f937315f83e070e91809bce0c
MD5 fa1fdcc8444416e9b25da7994e6e2148
BLAKE2b-256 72fd25b818fdb824923941d145e9f3a310223bed0a3ce40d5d5011c54b4929d7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page