nlcodec is a collection of encoding schemes for natural language sequences. nlcodec.db is a efficient storage and retrieval layer for integer sequences of varying lengths.
Project description
NLCodec
NOTE: The docs are available at https://isi-nlp.github.io/nlcodec
A set of (low level) Natural Language Encoder-Decoders (codecs), that are useful in preprocessing stage of NLP pipeline. These codecs include encoding of sequences into one of the following:
- Character
- Word
- BPE based subword
- Class
It provides python (so embed into your app) and CLI APIs (use it as stand alone tool).
There are many BPE implementations available already, but this one provides differs:
- Pure python implementation that is easy to modify anything to try new ideas. (other implementations require c++/rust expertise to modify the core)
- An easily shareable and inspectable model file. It is a simple text that can be inspected with
less
orcut
. It includes info on which pieces were put together and what frequencies etc. - Reasonably faster than the other pure python implementations. Under the hood tries, doubly linked lists, max-heaps, hash maps etc data-structures to boost performance.
- PySpark backend for extracting term frequencies from large datasets.
Installation
Please run only one of these
# Install from pypi (preferred)
$ pip install nlcodec --ignore-installed
# Clone repo for development mode
git clone https://github.com/isi-nlp/nlcodec
cd nlcodec
pip install --editable .
pip installer registers these CLI tools in your PATH:
nlcodec
-- CLI for learn, encode, decode. Same aspython -m nlcodec
nlcodec-learn
-- CLI for learn BPE, with PySpark backend. Same aspython -m nlcodec.learn
nlcodec-db
-- CLI for bitextdb.python -m nlcodec.bitextdb
nlcodec-freq
-- CLI for extracting word and char frequencies using spark backend.
Docs are available at
- HTML format: https://isi-nlp.github.io/nlcodec (recommended)
- Locally at docs/intro.adoc
Citation
Refer to https://arxiv.org/abs/2104.00290 To-appear: ACL 2021 Demos
@article{DBLP:journals/corr/abs-2104-00290,
author = {Thamme Gowda and
Zhao Zhang and
Chris A. Mattmann and
Jonathan May},
title = {Many-to-English Machine Translation Tools, Data, and Pretrained Models},
journal = {CoRR},
volume = {abs/2104.00290},
year = {2021},
url = {https://arxiv.org/abs/2104.00290},
archivePrefix = {arXiv},
eprint = {2104.00290},
timestamp = {Mon, 12 Apr 2021 16:14:56 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2104-00290.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Authors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.