vectorize some text
Project description
vecxx
C++ implementations of vectorizers to convert strings to integers. There are bindings available in Python and JS/TS.
This includes a straighforward C++ implementation of common approaches to vectorization to integers, required for most types of DNNs to convert sentences into tensors. It also supports native subword BPE based on fastBPE with additional support for preprocessing transforms of strings during decode, either as native functors or from the bound languages. It also supports extra (special) tokens that can be passed through.
Python bindings
The Python bindings are written with pybind11.
Using BPE vectorizer from Python
Converting sentences to lower-case subword BPE tokens as integers from the vocabulary.
Note that a python native string transform can be used to transform each token prior to subword tokenization.
Tokens from either the BPE vocab or special tokens (like <GO>
and <EOS>
) can be applied to the beginning and end of the sequence.
If a second argument is provided to convert_to_ids
this will indicate a padded length required for the tensor
from vecxx import *
bpe = BPEVocab(
vocab_file=os.path.join(TEST_DATA, "vocab.30k"),
codes_file=os.path.join(TEST_DATA, "codes.30k")
)
vec = VocabVectorizer(bpe, transform=str.lower, emit_begin_tok=["<GO>"], emit_end_tok=["<EOS>"])
padd_vec, unpadded_len = vec.convert_to_ids("My name is Dan . I am from Ann Arbor , Michigan , in Washtenaw County".split(), 256)
The result of this will be:
[1, 30, 265, 14, 2566, 5, 8, 158, 63, 10940, 525, 18637, 7, 3685, 7, 18, 14242, 1685, 2997, 4719, 2, 0, ..., 0]
JS/TS bindings
The Javascript bindings are provided by using the Node-API API.
A thin TypeScript wrapper provides a typed API that closely matches the underlying (and Python) APIs.
Using BPE vectorizer from TypeScript
import { BPEVocab, VocabVectorizer } from 'vecxx';
import { join } from 'path';
const testDir = join(__dirname, 'test_data');
const bpe = new BPEVocab(join(testDir, 'vocab.30k'), join(testDir, 'codes.30k'));
const vectorizer = new VocabVectorizer(bpe, {
transform: (s: string) => s.toLowerCase(),
emitBeginToken: ['<GO>'],
emitEndToken: ['<EOS>']
});
const sentence = `My name is Dan . I am from Ann Arbor , Michigan , in Washtenaw County`;
const { ids, size } = vectorizer.convertToIds(sentence.split(/\s+/), 256);
Docker
Sample Dockerfile
s are provided that can be used for sandbox development/testing.
docker build -t vecxx-python -f py.Dockerfile .
docker run -it vecxx-python
docker build -t vecxx-node -f node.Dockerfile .
docker run -it vecxx-node
# ...
var vecxx = require('dist/index.js')
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for vecxx-0.0.1-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 540700c6bac303e0acca17997c4c61f5a0756463df8d40daf05537d23abc6d56 |
|
MD5 | ea46d52748f998cf5a1ff666979d371e |
|
BLAKE2b-256 | 6e8916b705e86adc8103978e509e9b2c79253c71cf13628608411d5a43a404ba |
Hashes for vecxx-0.0.1-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a70867425f3831dc096dfa4f661ccdc123fe8646ee3db1ccfc7005d389c5e55 |
|
MD5 | 4108259007ff13ea344b0dfebecd71a0 |
|
BLAKE2b-256 | 2f93ff5ef03723188d875d7e1b26486f5b1389d72e66d0cc17c439c7ffd444be |
Hashes for vecxx-0.0.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 472571bfad087ee86cb823a4c761b7d035c743770b9e283d82912d9080cd17a6 |
|
MD5 | a892ca45e2916add5d0959ccdc4ce4af |
|
BLAKE2b-256 | 789ab71d220f77fc8a7172e06985ebefaaa2fbb21f9e223bfb575c3a9bdc86e0 |
Hashes for vecxx-0.0.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 523558a32064b2e980389bb9ed91d74b9c817daad5a46552538a4f0e030e32f1 |
|
MD5 | 5d53a7634493192a712581591ac7f6af |
|
BLAKE2b-256 | 2e898de008949f5f9f1825e167002281bb64d0d75bda4da911e36e2362399c14 |
Hashes for vecxx-0.0.1-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47e1fb72b66883ea17cab452502408d7d08351d8add62bed01edfda2e80b0e17 |
|
MD5 | a13fe798cb8011881ce530674adc6020 |
|
BLAKE2b-256 | bac60eb4175820fc0361594c00f4fcbe09d0f3a2b1321a758d4da2e7c64ffe50 |
Hashes for vecxx-0.0.1-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca8082d1a1a82c3fa6c5cc29f90d981fa541731e212378dca933ae2b0d54b62e |
|
MD5 | 21121e7be1c4bc1918ba5a7d0402a855 |
|
BLAKE2b-256 | 123e1d14f5319e92863a60db2a73f4a752c91807129d8e8e11e8d35ab84b0fa3 |
Hashes for vecxx-0.0.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b991b67f8a830d1deadcd3328d43c2407fe4e95533ba120071bdf9811cf1982 |
|
MD5 | 4465db2c452fe6d5e7e4ae7ee26d51d7 |
|
BLAKE2b-256 | ce75f310d62c40938a1c85b58cd8b7a7b773d8a84b7d5659404813efe158e1ff |
Hashes for vecxx-0.0.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c39a268dcd3fdb2a3aae8be8ffaa6025b31cb34c522def726d380ae1eb396664 |
|
MD5 | 30e49171784097374afc619a6a18cd94 |
|
BLAKE2b-256 | 349d64f24a1993755792f476437326a4d111ec588b97d5142feb5a4fbc63f9f5 |
Hashes for vecxx-0.0.1-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2748486b9b17bca0807f2b2d41d0fcbe6af242fdf459c859c93ce632804a7881 |
|
MD5 | a283b1129ff2a090fa775df383d4a3fc |
|
BLAKE2b-256 | f3a31b724ad6aff490b03b37b97793b4c422897c2a5dc5600edb29a54c3c0d14 |
Hashes for vecxx-0.0.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08253ffacbc22d19d20448fda7b54ac2c69a5d6b2de11a7d3c823e0ed3dc1c45 |
|
MD5 | e34d0126103de0dc2e733a664e6388a8 |
|
BLAKE2b-256 | 9977d5b37e04484c5d3471b94a2e77ca0dc64826af8830f372ad614a1a6bdbbb |
Hashes for vecxx-0.0.1-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d7714e5ea8c9218cbdc4b8748961a4b4ea4afbc733da1f6f4a5402a24e43655 |
|
MD5 | 00428853e2d35c946304ddcebb1ec5a9 |
|
BLAKE2b-256 | 39177f3dff095b5afd8d151c681a9f5fb02a1f1c8c84caa32eda9e9552d54522 |
Hashes for vecxx-0.0.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5e06dd344645fbdbeb31bc4913d864293934a1055c313e5a557bfa7c7a0bd99 |
|
MD5 | 21204e6749d5b54584878aca40aff387 |
|
BLAKE2b-256 | 1fbade74b8322093658dbd4e2072d5ed31d5aad579f78bc03216f3f1f30f1c17 |
Hashes for vecxx-0.0.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 720beff91ebdbff247c74616107e58e1116fcf5ef4afd8138fff8dbddb107c9e |
|
MD5 | d385318b647aa01882854da11a8e91e3 |
|
BLAKE2b-256 | 9228c211cd1e9c9f1d078bdc733449d0e4e27665e92282a72380150aedaa40a5 |
Hashes for vecxx-0.0.1-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7c356525ec3e1606d7559c08d9ab94c3bf24144a2a70a4543205085f204d411 |
|
MD5 | 37fff2eae5bf70243bbc9177cae47b5a |
|
BLAKE2b-256 | 2db35d04ecce31dccd06553d4432246dbf13b5dcc1af245e415b1ed9d5aa6af9 |
Hashes for vecxx-0.0.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f20499ea9ebbb114ff628d27a5ae9561f885e1af2cb9af04ddd6f9895ade307c |
|
MD5 | e88bfac4495d2af49a65ee8d9d3062ed |
|
BLAKE2b-256 | b764173a765db186cbe8b50d25f442d6014d774c8e19843230b00b084e7dd7a9 |
Hashes for vecxx-0.0.1-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c3a7858065638c6593e83c09917024085f3c2bced68551001c515d964bb4bb3 |
|
MD5 | 5e628c7223f78d2671f8a33d5d7d7ec0 |
|
BLAKE2b-256 | 5ddc2b4dcb7e11306fea99228c479d8762bf7ea9735bc91d3f1fc89e5f4dcc9e |
Hashes for vecxx-0.0.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6952f06f1fa63c63ea832bc4ebd4af6d425521315aafd7b62802b0a5130ccf0 |
|
MD5 | 31a163e3439ff369fb46c6897ebe89ab |
|
BLAKE2b-256 | 0f47e1c9e8f92bde8732e4b701c931cec084de7d698db703ecd83d28f3340271 |