A simple and effective tokenizer.

These details have not been verified by PyPI

Project description

TokenEase

TokenEase is a versatile and efficient tokenizer, designed to streamline the process of converting text into bag-of-words (BoW) vectors for natural language processing tasks. With its customizable options, TokenEase provides a smooth and seamless experience for developers and researchers alike. It is built on top of the popular spaCy and CountVectorizer from scikit-learn libraries, and is designed to be easy to use and integrate into your existing projects. It only supports English language at the moment. If you are interested in contributing to the project, please make a pull request or open an issue.

Installation

Installation using pip:

pip install tokenease

Installation from source (easy to do if you want to contribute or modify the code!):

Ideally create a virtual environment for Python 3.10+ and install poetry. Then install tokenease with poetry:

poetry install

Usage

Here's a simple usage guide for the Pipe class that you can include in your README.md:

The Pipe class is used to preprocess text data for natural language processing tasks. It provides a pipeline that can perform various transformations on the text data, such as removing accents, converting to lowercase, removing stop words, and tokenizing the text.

Here's a basic example of how to use the Pipe class:

from tokenease import Pipe

# Initialize the pipeline with the desired options
pipe = Pipe(strip_accents=True, lowercase=True, remove_stop_words=True)

# Fit the pipeline to your data and transform it into a bag of words representation
bow, docs = pipe.fit_transform(my_data)

# Transform new data using the fitted pipeline
new_bow, new_docs = pipe.transform(new_data)

# bow is a numpy array of shape (n_samples, n_features).
# docs is a list of strings, where each string is a document in the tokenized form with separator.

Saving and Loading the Pipeline

You can save the state of the pipeline to a file and load it later. This is useful if you want to reuse the same pipeline across multiple sessions or scripts.

Here's how you can do it:

# Save the pipeline
pipe.save('my_pipeline.joblib')

# Load the pipeline
loaded_pipe = Pipe.from_pretrained('my_pipeline.joblib')

Accessing the Vocabulary

After fitting the pipeline to your data, you can access the vocabulary (i.e., the set of unique tokens {tokens: indices}) like this:

# Get the vocabulary
vocab = pipe.vocabulary

Please note that this is a basic guide and you might need to adjust it based on the specific features and requirements of your project.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.5

Feb 9, 2024

0.1.4

Feb 9, 2024

0.1.3

Feb 7, 2024

0.1.2

Feb 7, 2024

0.1.1

Sep 23, 2023

0.1.0

Sep 23, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenease-0.1.5.tar.gz (16.2 kB view details)

Uploaded Feb 9, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokenease-0.1.5-py3-none-any.whl (16.9 kB view details)

Uploaded Feb 9, 2024 Python 3

File details

Details for the file tokenease-0.1.5.tar.gz.

File metadata

Download URL: tokenease-0.1.5.tar.gz
Upload date: Feb 9, 2024
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/23.3.0

File hashes

Hashes for tokenease-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`bddae34fce5f2f579a887bba828cb5932ef581551f8561cb54646fcccf962023`
MD5	`2d13d7b2de22ee0cb5793a405946b52d`
BLAKE2b-256	`1fa4d469b12e3742b89109386bcd498b5b09ed00b094f1700091f6a137cfcf47`

See more details on using hashes here.

File details

Details for the file tokenease-0.1.5-py3-none-any.whl.

File metadata

Download URL: tokenease-0.1.5-py3-none-any.whl
Upload date: Feb 9, 2024
Size: 16.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/23.3.0

File hashes

Hashes for tokenease-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`52ccf2e3d05f4366ed62622206d3109680fc08e665dde51147da21991df8fdac`
MD5	`3f104a6887856a8432e91a4a070ca87d`
BLAKE2b-256	`6df2289439e4ba23083ca35d529d98cab6e9cbfbd19071135dd086d8a65a80d0`

See more details on using hashes here.

tokenease 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

TokenEase

Installation

Usage

Saving and Loading the Pipeline

Accessing the Vocabulary

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes