hash tables for vectorizing text-based documents
Project description
If you have documents with class labels, and you want to create numeric representations of those documents that maximize inter-class differences, then this package is for you. This package provides vectorizing hash tables that quickly transform your text, optimizing for the maximum distance between document-class vectors.
This project has a C++ backend with a python interface, allowing for maximum speed and maximum interopability.
Installation
To install, use pip
:
pip install vhash
Minimum Working Example
This package follows the conventions of scikit-learn
to provide an
intuitive and familiar interface.
from typing import Any
from nptyping import NDArray
from vhash import VHash
# sample documents: each item in the list is its own document
docs = [
'hi, my name is Mike',
'hi, my name is George',
'hello, my name is Mike',
]
# class labels for each document
labels = [1, 0, 1]
# create & train model
vhash = VHash().fit(docs, labels)
# create numeric representation (2D float array)
numeric: NDArray[(Any, Any), float] = vhash.transform(docs)
Metrics
To see how this text transformer compares to BERT, check out the
sample notebook,
where we show how vhash
outperforms sBERT
on a sentiment analysis task.
Documentation
To get started, check out the docs!
If you will be contributing to this repo, checkout the developer guide.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.