Skip to main content

A suite of vectorizers for various data types.

Project description

Vectorizers Logo

Travis AppVeyor Codecov CircleCI ReadTheDocs

Vectorizers

There are a large number of machine learning tools for effectively exploring and working with data that is given as vectors (ideally with a defined notion of distance as well). There is also a large volume of data that does not come neatly packaged as vectors. It could be text data, variable length sequence data (either numeric or categorical), dataframes of mixed data types, sets of point clouds, or more. Usually, one way or another, such data can be wrangled into vectors in a way that preserves some relevant properties of the original data. This library seeks to provide a suite of a wide variety of general purpose techniques for such wrangling, making it easier and faster for users to get various kinds of unstructured sequence data into vector formats for exploration and machine learning.

Why use Vectorizers?

Data wrangling can be tedious, error-prone, and fragile when trying to integrate it into production pipelines. The vectorizers library aims to provide a set of easy to use tools for turning various kinds of unstructured sequence data into vectors. By following the scikit-learn transformer API we ensure that any of the vectorizer classes can be trivially integrated into existing sklearn workflows or pipelines. By keeping the vectorization approaches as general as possible (as opposed to specialising on very specific data types), we aim to ensure that a very broad range of data can be handled efficiently. Finally we aim to provide robust techniques with sound mathematical foundations over potentially more powerful but black-box approaches for greater transparency in data processing and transformation.

How to use Vectorizers

Quick start examples to be added soon …

For further examples on using this library for text we recommend checking out the documentation written up in the EasyData reproducible data science framework by some of our colleagues over at: https://github.com/hackalog/vectorizers_playground

Installing

Vectorizers is designed to be easy to install being a pure python module with relatively light requirements:

  • numpy

  • scipy

  • scikit-learn >= 0.22

  • numba >= 0.51

In the near future the package should be pip installable – check back for updates:

pip install vectorizers

To manually install this package:

wget https://github.com/TutteInstitute/vectorizers/archive/master.zip
unzip master.zip
rm master.zip
cd vectorizers-master
python setup.py install

Help and Support

This project is still young. The documentation is still growing. In the meantime please open an issue and we will try to provide any help and guidance that we can. Please also check the docstrings on the code, which provide some descriptions of the parameters.

Contributing

Contributions are more than welcome! There are lots of opportunities for potential projects, so please get in touch if you would like to help out. Everything from code to notebooks to examples and documentation are all equally valuable so please don’t feel you can’t contribute. We would greatly appreciate the contribution of tutorial notebooks applying vectorizer tools to diverse or interesting datasets. If you find vectorizers useful for your data please consider contributing an example showing how it can apply to the kind of data you work with!

To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

License

The vectorizers package is 3-clause BSD licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectorizers-0.1.0.tar.gz (93.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vectorizers-0.1.0-py3-none-any.whl (111.3 kB view details)

Uploaded Python 3

File details

Details for the file vectorizers-0.1.0.tar.gz.

File metadata

  • Download URL: vectorizers-0.1.0.tar.gz
  • Upload date:
  • Size: 93.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for vectorizers-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c5636347248b6a20823bc1c3e933c786decfb4469531494ecc5855f3513da3e0
MD5 3134e49613834c01cc8734198d132814
BLAKE2b-256 a6daeaeeaa47a75931f99aa9b68749d011fc8235cff8b622edf6e973a79d2d7e

See more details on using hashes here.

File details

Details for the file vectorizers-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vectorizers-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 111.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for vectorizers-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3e33213656cf362f8a075d4597ffaeb9db909499b6d9dd5156a2b7dcec12fd93
MD5 6e6b2ea5bb9f309024acfca5da390895
BLAKE2b-256 4da9ece04c13e26e9b1f6c69941c29dc00cae67dee741294df3c18771bd55d57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page