Skip to main content

A simple and effective tokenizer.

Project description

TokenEase

TokenEase is a versatile and efficient tokenizer, designed to streamline the process of converting text into bag-of-words (BoW) vectors for natural language processing tasks. With its customizable options, TokenEase provides a smooth and seamless experience for developers and researchers alike. It is built on top of the popular spaCy and CountVectorizer from scikit-learn libraries, and is designed to be easy to use and integrate into your existing projects. It only supports English language at the moment. If you are interested in contributing to the project, please make a pull request or open an issue.

Installation

Installation using pip:

pip install tokenease

Installation from source (easy to do if you want to contribute or modify the code!):

Ideally create a virtual environment for Python 3.10+ and install poetry. Then install tokenease with poetry:

poetry install

Usage

Here's a simple usage guide for the Pipe class that you can include in your README.md:

The Pipe class is used to preprocess text data for natural language processing tasks. It provides a pipeline that can perform various transformations on the text data, such as removing accents, converting to lowercase, removing stop words, and tokenizing the text.

Here's a basic example of how to use the Pipe class:

from tokenease import Pipe

# Initialize the pipeline with the desired options
pipe = Pipe(strip_accents=True, lowercase=True, remove_stop_words=True)

# Fit the pipeline to your data and transform it into a bag of words representation
bow, docs = pipe.fit_transform(my_data)

# Transform new data using the fitted pipeline
new_bow, new_docs = pipe.transform(new_data)

# bow is a numpy array of shape (n_samples, n_features).
# docs is a list of strings, where each string is a document in the tokenized form with separator.

Saving and Loading the Pipeline

You can save the state of the pipeline to a file and load it later. This is useful if you want to reuse the same pipeline across multiple sessions or scripts.

Here's how you can do it:

# Save the pipeline
pipe.save('my_pipeline.joblib')

# Load the pipeline
loaded_pipe = Pipe.from_pretrained('my_pipeline.joblib')

Accessing the Vocabulary

After fitting the pipeline to your data, you can access the vocabulary (i.e., the set of unique tokens {tokens: indices}) like this:

# Get the vocabulary
vocab = pipe.vocabulary

Please note that this is a basic guide and you might need to adjust it based on the specific features and requirements of your project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenease-0.1.5.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenease-0.1.5-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file tokenease-0.1.5.tar.gz.

File metadata

  • Download URL: tokenease-0.1.5.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/23.3.0

File hashes

Hashes for tokenease-0.1.5.tar.gz
Algorithm Hash digest
SHA256 bddae34fce5f2f579a887bba828cb5932ef581551f8561cb54646fcccf962023
MD5 2d13d7b2de22ee0cb5793a405946b52d
BLAKE2b-256 1fa4d469b12e3742b89109386bcd498b5b09ed00b094f1700091f6a137cfcf47

See more details on using hashes here.

File details

Details for the file tokenease-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: tokenease-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/23.3.0

File hashes

Hashes for tokenease-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 52ccf2e3d05f4366ed62622206d3109680fc08e665dde51147da21991df8fdac
MD5 3f104a6887856a8432e91a4a070ca87d
BLAKE2b-256 6df2289439e4ba23083ca35d529d98cab6e9cbfbd19071135dd086d8a65a80d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page