Vectorize some text
Project description
vecxx
Fast C++ implementations of vectorizers for machine learning models, with bindings in Python and JS/TS.
This includes a straighforward, cross-platform C++ implementation of common approaches to vectorization to integers, required for most types of DNNs to convert sentences into tensors. It also supports native subword BPE based on fastBPE with additional support for preprocessing transforms of strings during decode, either as native functors or from the bound languages. It also supports extra (special) tokens that can be passed through.
For BPE, the library supports execution of models that were trained either using subword-nmt or fastBPE (or any other system which shares the file format).
To support fast load times, vecxx
vocabs can be compiled to a perfect hash and persisted to disk. When constructing the objects from the compiled directory, the loading will use memory mapping. For a BPE vocab, a non-compiled load that takes 24 milliseconds, may take only 200 microseconds after compilation.
Handling of prefix and suffix tokens
The API supports passing in a vector of tokens for prefixing the data, and for suffixing. These are
passed as a []
in Python, and std::vector<std::string>
in C++. Note that for prefixing, the prefix tokens will always be present, but when converting to integer ids, if the output buffer is presized and is too short to fill the tokens, the library currently will not overwrite the last values in the buffer to the end tokens. The rationale is that, typically an end suffix will be something marking an end-of-sentence or utterance, but since the result is truncated, the end was not actually reached. Future versions of the library may change this
Handling of multiple input sentences
The library supports converting multiple sentences into a "stack" of tokens with a fixed output size. This may be useful for batching or using N-best lists as input. The results from the ID stack are contiguous 1D arrays. They will typically be reshaped in user code:
nbests = [t.split() for t in text]
ids, lengths = vec.convert_to_ids_stack(nbests, len(nbests)*args.mxlen)
ids = np.array(ids).reshape(1, len(nbests), -1)
example = {args.feature: ids, 'lengths': np.array(lengths).reshape(1, -1)}
C++
#include "vecxx/vecxx.h"
std::vector<std::string> SENTENCE = {
"My", "name", "is", "Dan", ".", "I", "am", "from", "Ann", "Arbor",
",", "Michigan", ",", "in", "Washtenaw", "County"
};
int main(int argc, char** argv) {
std::string vocab_file(argv[1]);
std::string codes_file(argv[2]);
auto v = new BPEVocab(vocab_file, codes_file);
VocabVectorizer vec(v);
for (auto p : vec.convert_to_pieces(SENTENCE)) {
std::cout << p << std::endl;
}
std::vector<int> ids;
int l;
std::tie(ids, l) = vec.convert_to_ids(SENTENCE);
for (auto p : ids) {
std::cout << p << std::endl;
}
delete v;
return 0;
}
Vocab compilation
The initial load of the vocab is typically relatively fast, as it just reads in text files. However, if we need much lower latency, we can optionally compile the internal data structures to memory-mapped perfect hashes.
Once compiled pass the compiled target folder in as both the vocab_file
and codes_file
.
auto v = new BPEVocab(vocab_file, codes_file);
v->compile_vocab(compiled_dir);
delete v;
return 0;
Python bindings
The Python bindings are written with pybind11.
You can install the python bindings using pip
:
pip install vecxx
Using BPE vectorizer from Python
Converting sentences to lower-case subword BPE tokens as integers from the vocabulary.
Note that a python native string transform can be used to transform each token prior to subword tokenization.
Tokens from either the BPE vocab or special tokens (like <GO>
and <EOS>
) can be applied to the beginning and end of the sequence.
If a second argument is provided to convert_to_ids
this will indicate a padded length required for the tensor
from vecxx import *
bpe = BPEVocab(
vocab_file=os.path.join(TEST_DATA, "vocab.30k"),
codes_file=os.path.join(TEST_DATA, "codes.30k")
)
vec = VocabVectorizer(bpe, transform=str.lower, emit_begin_tok=["<GO>"], emit_end_tok=["<EOS>"])
padd_vec, unpadded_len = vec.convert_to_ids("My name is Dan . I am from Ann Arbor , Michigan , in Washtenaw County".split(), 256)
The result of this will be:
[1, 30, 265, 14, 2566, 5, 8, 158, 63, 10940, 525, 18637, 7, 3685, 7, 18, 14242, 1685, 2997, 4719, 2, 0, ..., 0]
Vocab compilation
As in the C++ example, we can optionally compile the internal data structures to memory-mapped perfect hashes. Once compiled they can be read in the same way.
>>> import vecxx
>>> b = vecxx.BPEVocab('/data/reddit/vocab.30k', '/data/reddit/codes.30k', extra_tokens=["[CLS]", "[MASK]"])
>>> b.compile_vocab('blah')
>>> b2 = vecxx.BPEVocab('blah', 'blah')
JS/TS bindings
The Javascript bindings are provided by using the Node-API API.
A thin TypeScript wrapper provides a typed API that closely matches the underlying (and Python) APIs.
Using BPE vectorizer from TypeScript
import { BPEVocab, VocabVectorizer } from 'vecxx';
import { join } from 'path';
const testDir = join(__dirname, 'test_data');
const bpe = new BPEVocab(join(testDir, 'vocab.30k'), join(testDir, 'codes.30k'));
const vectorizer = new VocabVectorizer(bpe, {
transform: (s: string) => s.toLowerCase(),
emitBeginToken: ['<GO>'],
emitEndToken: ['<EOS>']
});
const sentence = `My name is Dan . I am from Ann Arbor , Michigan , in Washtenaw County`;
const { ids, size } = vectorizer.convertToIds(sentence.split(/\s+/), 256);
Docker
Sample Dockerfile
s are provided that can be used for sandbox development/testing.
docker build -t vecxx-python -f py.Dockerfile .
docker run -it vecxx-python
docker build -t vecxx-node -f node.Dockerfile .
docker run -it vecxx-node
# ...
var vecxx = require('dist/index.js')
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file vecxx-0.0.5.tar.gz
.
File metadata
- Download URL: vecxx-0.0.5.tar.gz
- Upload date:
- Size: 9.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3f04147178840793cbe268ac648b8eac23b5ecc96118be3b52e4bec12e10f71 |
|
MD5 | b4fe7a7e886980011dc565b730c830b0 |
|
BLAKE2b-256 | 374a286394338986a4e70bbf953c1a8348cdb355b8146262d8009681e996279b |
File details
Details for the file vecxx-0.0.5-cp39-cp39-manylinux2010_x86_64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp39-cp39-manylinux2010_x86_64.whl
- Upload date:
- Size: 205.5 kB
- Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e23bb89a73704b88ff112a60906d1379f9bdcf60967b80729d44a030d588100 |
|
MD5 | 43ab2c6081054142491489005410ef35 |
|
BLAKE2b-256 | 3f12e9e5793eba7c469ac50ed6d56b01c5b43f68bc5e395b6dc0d85941854da9 |
File details
Details for the file vecxx-0.0.5-cp39-cp39-manylinux2010_i686.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp39-cp39-manylinux2010_i686.whl
- Upload date:
- Size: 217.5 kB
- Tags: CPython 3.9, manylinux: glibc 2.12+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc78f1d4f4c0190a30c49088f7e8cb69e3bc22dcc0e3f0ad1c617779a99a7c65 |
|
MD5 | 4681e01e23aae5fc9919eb55ad4b65c4 |
|
BLAKE2b-256 | 3b7645e3f9aa8f2399e6a5c7c5012f65c3eb5ec4573ad44a6ae81a4497195fa5 |
File details
Details for the file vecxx-0.0.5-cp39-cp39-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 189.8 kB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c36c9d3a6793ee5bf22946c40a20c2ebd641d242f02c837187d09ae74df1a7fa |
|
MD5 | 97f9ac133af6a469f6e5cbdbb796c6a1 |
|
BLAKE2b-256 | 2d6e81a332347d820617779f37383f031f36e20bf6c27e3165210e43cc250dd9 |
File details
Details for the file vecxx-0.0.5-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 151.1 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6073e542076b1ac505fe9a096382cc42350b23968b69dbd7ff100b66b8cfedaa |
|
MD5 | bf42406d41c29da8dfca0aa082fa8566 |
|
BLAKE2b-256 | 21c75db78128ee6875d2d0ade0ad645eb448583c63c8dc8ed49ac57f0156fea5 |
File details
Details for the file vecxx-0.0.5-cp38-cp38-win32.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp38-cp38-win32.whl
- Upload date:
- Size: 130.9 kB
- Tags: CPython 3.8, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20c3663690a54fd48294a5f071555ded0250708e8254a8a97981a28bbb3206ec |
|
MD5 | aaef4d76bf0ad1cabffd82fef0460045 |
|
BLAKE2b-256 | a82365bea1fad30375612be34a41879545afccd4b45d529177c51283196a1e44 |
File details
Details for the file vecxx-0.0.5-cp38-cp38-manylinux2010_x86_64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp38-cp38-manylinux2010_x86_64.whl
- Upload date:
- Size: 205.2 kB
- Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3e0ca51d4a41bddfb693ba5f545c5e175326731fceced3098560a3cb502d3d2 |
|
MD5 | bf9c674b6c5fdeba9b549127aa3e092c |
|
BLAKE2b-256 | f4e7221a85b8ce1e502a9b1cd5837c624a4c238aadcbde3ba47574a923a24e19 |
File details
Details for the file vecxx-0.0.5-cp38-cp38-manylinux2010_i686.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp38-cp38-manylinux2010_i686.whl
- Upload date:
- Size: 216.8 kB
- Tags: CPython 3.8, manylinux: glibc 2.12+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7932f349cd18a96979b41aa72433f6c80e45a74707415f6f20d5c42c224ad1e |
|
MD5 | 7dfd25df329b1cfd4dd0152ee68a2137 |
|
BLAKE2b-256 | d7865678a0724e6d3473a126708db1b747925a2bdb609c9c594368e844fa6350 |
File details
Details for the file vecxx-0.0.5-cp38-cp38-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp38-cp38-macosx_10_9_x86_64.whl
- Upload date:
- Size: 189.5 kB
- Tags: CPython 3.8, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 396e1656108fce294e4f8478761a000a89f4c8c9820f06f5055a628c64ae6916 |
|
MD5 | 108a04281db9871a55c1dbdbb234fcb5 |
|
BLAKE2b-256 | ce1ce66e5f9997e9f7ad4657ca5a484450adb7d36730c019408087f5760b0866 |
File details
Details for the file vecxx-0.0.5-cp37-cp37m-win_amd64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp37-cp37m-win_amd64.whl
- Upload date:
- Size: 151.4 kB
- Tags: CPython 3.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ded4ff9a9a40d9bfb475acb2f00c6d954768fbc3e88c58bc115394fb4ffd72c0 |
|
MD5 | c5c44e1b3f5b8064215cfceeb5230dbb |
|
BLAKE2b-256 | 056f1399dde7a7da746f42a2e25e31ab468b3df8ae80b5ab474a8a3add5b0d98 |
File details
Details for the file vecxx-0.0.5-cp37-cp37m-win32.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp37-cp37m-win32.whl
- Upload date:
- Size: 132.1 kB
- Tags: CPython 3.7m, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b58a684dc01b762df86a79d2b5ef6897f7b4753462c3824b567fa9893dfc8792 |
|
MD5 | 15fa9171c271247e0e479f08694438a1 |
|
BLAKE2b-256 | 29aed65c5793920cef1387f8991394e670a83c31752c74c92586accc404cb6f7 |
File details
Details for the file vecxx-0.0.5-cp37-cp37m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp37-cp37m-manylinux2010_x86_64.whl
- Upload date:
- Size: 210.0 kB
- Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70121253657d0fcbc4df73c3480a79d4cb0bed32dfd34a86ac1242d1d3e475da |
|
MD5 | 193836466090e4b8fe284643158f0eed |
|
BLAKE2b-256 | 69d49499ed6331b02d2b80225aee1c0aec2d5dead06231d0e3acd712e5ee8a61 |
File details
Details for the file vecxx-0.0.5-cp37-cp37m-manylinux2010_i686.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp37-cp37m-manylinux2010_i686.whl
- Upload date:
- Size: 223.7 kB
- Tags: CPython 3.7m, manylinux: glibc 2.12+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2e76da9b54c41d37dd569a725b0f1c96496bc108c039d7a15c0b56999271c2b |
|
MD5 | b020a26daa74a21bcea17feb25716758 |
|
BLAKE2b-256 | 22285501ec038768da55f63fec0b89eb171e00cde325a1b3787d3163e2a3995e |
File details
Details for the file vecxx-0.0.5-cp37-cp37m-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp37-cp37m-macosx_10_9_x86_64.whl
- Upload date:
- Size: 187.0 kB
- Tags: CPython 3.7m, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5433efb1a2ea547a6b834a13dcd7be140cc76658a718459b59d8f0c0f181449b |
|
MD5 | 9ca0163b8c2f3cbaf36680e8ba0cdc28 |
|
BLAKE2b-256 | d8bafa0da3b1b20fa9dda9716bdedaa065284835d41f687a2f24dc92894759dd |
File details
Details for the file vecxx-0.0.5-cp36-cp36m-win_amd64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp36-cp36m-win_amd64.whl
- Upload date:
- Size: 152.7 kB
- Tags: CPython 3.6m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b568c253164dc0096186d79088806715a1fce67900011ebbc13909f05cf60db |
|
MD5 | 425be286a25d03e66075a5a1a7b7f176 |
|
BLAKE2b-256 | 5dba42120f3039708f359ab69f2eb4270f28f3fef47a3dec3fb425044b1a393a |
File details
Details for the file vecxx-0.0.5-cp36-cp36m-win32.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp36-cp36m-win32.whl
- Upload date:
- Size: 133.2 kB
- Tags: CPython 3.6m, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34efff26b01ccce68b3345293b36d7f41dbedd9138cd016bafcece7860c353bb |
|
MD5 | 56e9ca18759788b506118837ccc73a74 |
|
BLAKE2b-256 | fdd0680a9cab7ef202efb44c40b1c893d1f2bdf664ce1dae18f5f92d83d6deea |
File details
Details for the file vecxx-0.0.5-cp36-cp36m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp36-cp36m-manylinux2010_x86_64.whl
- Upload date:
- Size: 210.0 kB
- Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cff1dd8e241a466f42e5b852975d0d787d1bdcf2eede05bf0bb071bd898a58cb |
|
MD5 | a1af91f60027d35500e4751e78a1473e |
|
BLAKE2b-256 | d64fe7c7bad432c94fcbe2b41f53e525a8a04d12ab39dc0983b31340212941a4 |
File details
Details for the file vecxx-0.0.5-cp36-cp36m-manylinux2010_i686.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp36-cp36m-manylinux2010_i686.whl
- Upload date:
- Size: 223.7 kB
- Tags: CPython 3.6m, manylinux: glibc 2.12+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c9df46c6de495f14949fae832d6d48e5a77f65040b39736fdf41f843b39e8e0 |
|
MD5 | 683a4492b2b9a7040233b1b2b8590f6e |
|
BLAKE2b-256 | c5cfca2c149c6ba111f0c793f5864b9ffe84acd7799754bc2c24187eb2f6e5b7 |
File details
Details for the file vecxx-0.0.5-cp36-cp36m-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: vecxx-0.0.5-cp36-cp36m-macosx_10_9_x86_64.whl
- Upload date:
- Size: 187.1 kB
- Tags: CPython 3.6m, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 57cf7f5edf0a0812a65710594f5c3cbf918c47eb211d86021f61d1ae3fc101d0 |
|
MD5 | cb6b3638c6da4cc2da24aae2e58a9c6a |
|
BLAKE2b-256 | 9bf19c5c52cbc5307e646449b2807d3b0b95cf6300092150ff9810f3ae3087df |