Skip to main content

Vectorize some text

Project description

vecxx

Python Build Status Node.js Build Status

Fast C++ implementations of vectorizers for machine learning models, with bindings in Python and JS/TS.

This includes a straighforward, cross-platform C++ implementation of common approaches to vectorization to integers, required for most types of DNNs to convert sentences into tensors. It also supports native subword BPE based on fastBPE with additional support for preprocessing transforms of strings during decode, either as native functors or from the bound languages. It also supports extra (special) tokens that can be passed through.

For BPE, the library supports execution of models that were trained either using subword-nmt or fastBPE (or any other system which shares the file format).

To support fast load times, vecxx vocabs can be compiled to a perfect hash and persisted to disk. When constructing the objects from the compiled directory, the loading will use memory mapping. For a BPE vocab, a non-compiled load that takes 24 milliseconds, may take only 200 microseconds after compilation.

Handling of prefix and suffix tokens

The API supports passing in a vector of tokens for prefixing the data, and for suffixing. These are passed as a [] in Python, and std::vector<std::string> in C++. Note that for prefixing, the prefix tokens will always be present, but when converting to integer ids, if the output buffer is presized and is too short to fill the tokens, the library currently will not overwrite the last values in the buffer to the end tokens. The rationale is that, typically an end suffix will be something marking an end-of-sentence or utterance, but since the result is truncated, the end was not actually reached. Future versions of the library may change this

Handling of multiple input sentences

The library supports converting multiple sentences into a "stack" of tokens with a fixed output size. This may be useful for batching or using N-best lists as input. The results from the ID stack are contiguous 1D arrays. They will typically be reshaped in user code:

nbests = [t.split() for t in text]
ids, lengths = vec.convert_to_ids_stack(nbests, len(nbests)*args.mxlen)
ids = np.array(ids).reshape(1, len(nbests), -1)
example = {args.feature: ids, 'lengths': np.array(lengths).reshape(1, -1)}

C++

#include "vecxx/vecxx.h"

std::vector<std::string> SENTENCE = {
	"My", "name", "is", "Dan", ".", "I", "am", "from", "Ann", "Arbor",
	",", "Michigan", ",", "in", "Washtenaw", "County"
};

int main(int argc, char** argv) {
    std::string vocab_file(argv[1]);
    std::string codes_file(argv[2]);
    auto v = new BPEVocab(vocab_file, codes_file);
    VocabVectorizer vec(v);
    for (auto p : vec.convert_to_pieces(SENTENCE)) {
    	std::cout << p << std::endl;
    }

    std::vector<int> ids;
    int l;
    std::tie(ids, l) = vec.convert_to_ids(SENTENCE);
    for (auto p : ids) {
    	std::cout << p << std::endl;
    }

    delete v;
    return 0;
}

Vocab compilation

The initial load of the vocab is typically relatively fast, as it just reads in text files. However, if we need much lower latency, we can optionally compile the internal data structures to memory-mapped perfect hashes.

Once compiled pass the compiled target folder in as both the vocab_file and codes_file.

  auto v = new BPEVocab(vocab_file, codes_file);
  v->compile_vocab(compiled_dir);
  delete v;
  return 0;

Python bindings

The Python bindings are written with pybind11.

You can install the python bindings using pip:

pip install vecxx

Using BPE vectorizer from Python

Converting sentences to lower-case subword BPE tokens as integers from the vocabulary. Note that a python native string transform can be used to transform each token prior to subword tokenization. Tokens from either the BPE vocab or special tokens (like <GO> and <EOS>) can be applied to the beginning and end of the sequence. If a second argument is provided to convert_to_ids this will indicate a padded length required for the tensor

    from vecxx import *
    bpe = BPEVocab(
        vocab_file=os.path.join(TEST_DATA, "vocab.30k"),
        codes_file=os.path.join(TEST_DATA, "codes.30k")
    )
    vec = VocabVectorizer(bpe, transform=str.lower, emit_begin_tok=["<GO>"], emit_end_tok=["<EOS>"])
    padd_vec, unpadded_len = vec.convert_to_ids("My name is Dan . I am from Ann Arbor , Michigan , in Washtenaw County".split(), 256)

The result of this will be:

[1, 30, 265, 14, 2566, 5, 8, 158, 63, 10940, 525, 18637, 7, 3685, 7, 18, 14242, 1685, 2997, 4719, 2, 0, ..., 0]

Vocab compilation

As in the C++ example, we can optionally compile the internal data structures to memory-mapped perfect hashes. Once compiled they can be read in the same way.

>>> import vecxx
>>> b = vecxx.BPEVocab('/data/reddit/vocab.30k', '/data/reddit/codes.30k', extra_tokens=["[CLS]", "[MASK]"])
>>> b.compile_vocab('blah')
>>> b2 = vecxx.BPEVocab('blah', 'blah')

JS/TS bindings

The Javascript bindings are provided by using the Node-API API.

A thin TypeScript wrapper provides a typed API that closely matches the underlying (and Python) APIs.

Using BPE vectorizer from TypeScript

import { BPEVocab, VocabVectorizer } from 'vecxx';
import { join } from 'path';

const testDir = join(__dirname, 'test_data');
const bpe = new BPEVocab(join(testDir, 'vocab.30k'), join(testDir, 'codes.30k'));
const vectorizer = new VocabVectorizer(bpe, {
    transform: (s: string) => s.toLowerCase(),
    emitBeginToken: ['<GO>'],
    emitEndToken: ['<EOS>']
});
const sentence = `My name is Dan . I am from Ann Arbor , Michigan , in Washtenaw County`;
const { ids, size } = vectorizer.convertToIds(sentence.split(/\s+/), 256);

Docker

Sample Dockerfiles are provided that can be used for sandbox development/testing.

docker build -t vecxx-python -f py.Dockerfile .
docker run -it vecxx-python
docker build -t vecxx-node -f node.Dockerfile .
docker run -it vecxx-node
# ...
var vecxx = require('dist/index.js')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vecxx-0.0.5.tar.gz (9.6 kB view details)

Uploaded Source

Built Distributions

vecxx-0.0.5-cp39-cp39-manylinux2010_x86_64.whl (205.5 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

vecxx-0.0.5-cp39-cp39-manylinux2010_i686.whl (217.5 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ i686

vecxx-0.0.5-cp39-cp39-macosx_10_9_x86_64.whl (189.8 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

vecxx-0.0.5-cp38-cp38-win_amd64.whl (151.1 kB view details)

Uploaded CPython 3.8 Windows x86-64

vecxx-0.0.5-cp38-cp38-win32.whl (130.9 kB view details)

Uploaded CPython 3.8 Windows x86

vecxx-0.0.5-cp38-cp38-manylinux2010_x86_64.whl (205.2 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

vecxx-0.0.5-cp38-cp38-manylinux2010_i686.whl (216.8 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ i686

vecxx-0.0.5-cp38-cp38-macosx_10_9_x86_64.whl (189.5 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

vecxx-0.0.5-cp37-cp37m-win_amd64.whl (151.4 kB view details)

Uploaded CPython 3.7m Windows x86-64

vecxx-0.0.5-cp37-cp37m-win32.whl (132.1 kB view details)

Uploaded CPython 3.7m Windows x86

vecxx-0.0.5-cp37-cp37m-manylinux2010_x86_64.whl (210.0 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

vecxx-0.0.5-cp37-cp37m-manylinux2010_i686.whl (223.7 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ i686

vecxx-0.0.5-cp37-cp37m-macosx_10_9_x86_64.whl (187.0 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

vecxx-0.0.5-cp36-cp36m-win_amd64.whl (152.7 kB view details)

Uploaded CPython 3.6m Windows x86-64

vecxx-0.0.5-cp36-cp36m-win32.whl (133.2 kB view details)

Uploaded CPython 3.6m Windows x86

vecxx-0.0.5-cp36-cp36m-manylinux2010_x86_64.whl (210.0 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

vecxx-0.0.5-cp36-cp36m-manylinux2010_i686.whl (223.7 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ i686

vecxx-0.0.5-cp36-cp36m-macosx_10_9_x86_64.whl (187.1 kB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file vecxx-0.0.5.tar.gz.

File metadata

  • Download URL: vecxx-0.0.5.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for vecxx-0.0.5.tar.gz
Algorithm Hash digest
SHA256 f3f04147178840793cbe268ac648b8eac23b5ecc96118be3b52e4bec12e10f71
MD5 b4fe7a7e886980011dc565b730c830b0
BLAKE2b-256 374a286394338986a4e70bbf953c1a8348cdb355b8146262d8009681e996279b

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6e23bb89a73704b88ff112a60906d1379f9bdcf60967b80729d44a030d588100
MD5 43ab2c6081054142491489005410ef35
BLAKE2b-256 3f12e9e5793eba7c469ac50ed6d56b01c5b43f68bc5e395b6dc0d85941854da9

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp39-cp39-manylinux2010_i686.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp39-cp39-manylinux2010_i686.whl
Algorithm Hash digest
SHA256 cc78f1d4f4c0190a30c49088f7e8cb69e3bc22dcc0e3f0ad1c617779a99a7c65
MD5 4681e01e23aae5fc9919eb55ad4b65c4
BLAKE2b-256 3b7645e3f9aa8f2399e6a5c7c5012f65c3eb5ec4573ad44a6ae81a4497195fa5

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c36c9d3a6793ee5bf22946c40a20c2ebd641d242f02c837187d09ae74df1a7fa
MD5 97f9ac133af6a469f6e5cbdbb796c6a1
BLAKE2b-256 2d6e81a332347d820617779f37383f031f36e20bf6c27e3165210e43cc250dd9

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: vecxx-0.0.5-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 151.1 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for vecxx-0.0.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 6073e542076b1ac505fe9a096382cc42350b23968b69dbd7ff100b66b8cfedaa
MD5 bf42406d41c29da8dfca0aa082fa8566
BLAKE2b-256 21c75db78128ee6875d2d0ade0ad645eb448583c63c8dc8ed49ac57f0156fea5

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp38-cp38-win32.whl.

File metadata

  • Download URL: vecxx-0.0.5-cp38-cp38-win32.whl
  • Upload date:
  • Size: 130.9 kB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for vecxx-0.0.5-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 20c3663690a54fd48294a5f071555ded0250708e8254a8a97981a28bbb3206ec
MD5 aaef4d76bf0ad1cabffd82fef0460045
BLAKE2b-256 a82365bea1fad30375612be34a41879545afccd4b45d529177c51283196a1e44

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 a3e0ca51d4a41bddfb693ba5f545c5e175326731fceced3098560a3cb502d3d2
MD5 bf9c674b6c5fdeba9b549127aa3e092c
BLAKE2b-256 f4e7221a85b8ce1e502a9b1cd5837c624a4c238aadcbde3ba47574a923a24e19

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp38-cp38-manylinux2010_i686.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp38-cp38-manylinux2010_i686.whl
Algorithm Hash digest
SHA256 b7932f349cd18a96979b41aa72433f6c80e45a74707415f6f20d5c42c224ad1e
MD5 7dfd25df329b1cfd4dd0152ee68a2137
BLAKE2b-256 d7865678a0724e6d3473a126708db1b747925a2bdb609c9c594368e844fa6350

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 396e1656108fce294e4f8478761a000a89f4c8c9820f06f5055a628c64ae6916
MD5 108a04281db9871a55c1dbdbb234fcb5
BLAKE2b-256 ce1ce66e5f9997e9f7ad4657ca5a484450adb7d36730c019408087f5760b0866

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: vecxx-0.0.5-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 151.4 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for vecxx-0.0.5-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 ded4ff9a9a40d9bfb475acb2f00c6d954768fbc3e88c58bc115394fb4ffd72c0
MD5 c5c44e1b3f5b8064215cfceeb5230dbb
BLAKE2b-256 056f1399dde7a7da746f42a2e25e31ab468b3df8ae80b5ab474a8a3add5b0d98

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp37-cp37m-win32.whl.

File metadata

  • Download URL: vecxx-0.0.5-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 132.1 kB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for vecxx-0.0.5-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 b58a684dc01b762df86a79d2b5ef6897f7b4753462c3824b567fa9893dfc8792
MD5 15fa9171c271247e0e479f08694438a1
BLAKE2b-256 29aed65c5793920cef1387f8991394e670a83c31752c74c92586accc404cb6f7

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 70121253657d0fcbc4df73c3480a79d4cb0bed32dfd34a86ac1242d1d3e475da
MD5 193836466090e4b8fe284643158f0eed
BLAKE2b-256 69d49499ed6331b02d2b80225aee1c0aec2d5dead06231d0e3acd712e5ee8a61

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp37-cp37m-manylinux2010_i686.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp37-cp37m-manylinux2010_i686.whl
Algorithm Hash digest
SHA256 f2e76da9b54c41d37dd569a725b0f1c96496bc108c039d7a15c0b56999271c2b
MD5 b020a26daa74a21bcea17feb25716758
BLAKE2b-256 22285501ec038768da55f63fec0b89eb171e00cde325a1b3787d3163e2a3995e

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5433efb1a2ea547a6b834a13dcd7be140cc76658a718459b59d8f0c0f181449b
MD5 9ca0163b8c2f3cbaf36680e8ba0cdc28
BLAKE2b-256 d8bafa0da3b1b20fa9dda9716bdedaa065284835d41f687a2f24dc92894759dd

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: vecxx-0.0.5-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 152.7 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for vecxx-0.0.5-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 8b568c253164dc0096186d79088806715a1fce67900011ebbc13909f05cf60db
MD5 425be286a25d03e66075a5a1a7b7f176
BLAKE2b-256 5dba42120f3039708f359ab69f2eb4270f28f3fef47a3dec3fb425044b1a393a

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp36-cp36m-win32.whl.

File metadata

  • Download URL: vecxx-0.0.5-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 133.2 kB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for vecxx-0.0.5-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 34efff26b01ccce68b3345293b36d7f41dbedd9138cd016bafcece7860c353bb
MD5 56e9ca18759788b506118837ccc73a74
BLAKE2b-256 fdd0680a9cab7ef202efb44c40b1c893d1f2bdf664ce1dae18f5f92d83d6deea

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 cff1dd8e241a466f42e5b852975d0d787d1bdcf2eede05bf0bb071bd898a58cb
MD5 a1af91f60027d35500e4751e78a1473e
BLAKE2b-256 d64fe7c7bad432c94fcbe2b41f53e525a8a04d12ab39dc0983b31340212941a4

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp36-cp36m-manylinux2010_i686.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp36-cp36m-manylinux2010_i686.whl
Algorithm Hash digest
SHA256 6c9df46c6de495f14949fae832d6d48e5a77f65040b39736fdf41f843b39e8e0
MD5 683a4492b2b9a7040233b1b2b8590f6e
BLAKE2b-256 c5cfca2c149c6ba111f0c793f5864b9ffe84acd7799754bc2c24187eb2f6e5b7

See more details on using hashes here.

File details

Details for the file vecxx-0.0.5-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for vecxx-0.0.5-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 57cf7f5edf0a0812a65710594f5c3cbf918c47eb211d86021f61d1ae3fc101d0
MD5 cb6b3638c6da4cc2da24aae2e58a9c6a
BLAKE2b-256 9bf19c5c52cbc5307e646449b2807d3b0b95cf6300092150ff9810f3ae3087df

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page