A simple and effective tokenizer.
Project description
TokenEase
TokenEase is a versatile and efficient tokenizer, designed to streamline the process of converting text into bag-of-words (BoW) vectors for natural language processing tasks. With its customizable options, TokenEase provides a smooth and seamless experience for developers and researchers alike. It is built on top of the popular spaCy and CountVectorizer from scikit-learn libraries, and is designed to be easy to use and integrate into your existing projects. It only supports English language at the moment. If you are interested in contributing to the project, please make a pull request or open an issue.
Installation
Installation using pip:
pip install tokenease
Installation from source (easy to do if you want to contribute or modify the code!):
Ideally create a virtual environment for Python 3.10+ and install poetry. Then install tokenease with poetry:
poetry install
Usage
Here's a simple usage guide for the Pipe class that you can include in your README.md:
The Pipe class is used to preprocess text data for natural language processing tasks. It provides a pipeline that can perform various transformations on the text data, such as removing accents, converting to lowercase, removing stop words, and tokenizing the text.
Here's a basic example of how to use the Pipe class:
from tokenease import Pipe
# Initialize the pipeline with the desired options
pipe = Pipe(strip_accents=True, lowercase=True, remove_stop_words=True)
# Fit the pipeline to your data and transform it into a bag of words representation
bow, docs = pipe.fit_transform(my_data)
# Transform new data using the fitted pipeline
new_bow, new_docs = pipe.transform(new_data)
# bow is a numpy array of shape (n_samples, n_features).
# docs is a list of strings, where each string is a document in the tokenized form with separator.
Saving and Loading the Pipeline
You can save the state of the pipeline to a file and load it later. This is useful if you want to reuse the same pipeline across multiple sessions or scripts.
Here's how you can do it:
# Save the pipeline
pipe.save('my_pipeline.joblib')
# Load the pipeline
loaded_pipe = Pipe.from_pretrained('my_pipeline.joblib')
Accessing the Vocabulary
After fitting the pipeline to your data, you can access the vocabulary (i.e., the set of unique tokens {tokens: indices}) like this:
# Get the vocabulary
vocab = pipe.vocabulary
Please note that this is a basic guide and you might need to adjust it based on the specific features and requirements of your project.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokenease-0.1.5.tar.gz.
File metadata
- Download URL: tokenease-0.1.5.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/23.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bddae34fce5f2f579a887bba828cb5932ef581551f8561cb54646fcccf962023
|
|
| MD5 |
2d13d7b2de22ee0cb5793a405946b52d
|
|
| BLAKE2b-256 |
1fa4d469b12e3742b89109386bcd498b5b09ed00b094f1700091f6a137cfcf47
|
File details
Details for the file tokenease-0.1.5-py3-none-any.whl.
File metadata
- Download URL: tokenease-0.1.5-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/23.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52ccf2e3d05f4366ed62622206d3109680fc08e665dde51147da21991df8fdac
|
|
| MD5 |
3f104a6887856a8432e91a4a070ca87d
|
|
| BLAKE2b-256 |
6df2289439e4ba23083ca35d529d98cab6e9cbfbd19071135dd086d8a65a80d0
|