Skip to main content

A basic character-level tokenizer

Project description

CharTokenizer

Python - >= 3.6 License PyPi - chartokenizer

Documentation | Pypi | Author

Chartokenizer is a Python package for basic character-level tokenization. It provides functionality to generate a character-to-index mapping for tokenizing strings at the character level. This can be useful in various natural language processing (NLP) tasks where text data needs to be preprocessed for analysis or modeling.

Author: Shashank Kanna R


🚀 Benefits

  1. Generates a character-to-index mapping for tokenizing strings.
text = "This is a Demo Text."

# When tokenized using chartokenizer

{' ': 0, '.': 1, 'D': 2, 'T': 3, 'a': 4, 'e': 5, 'h': 6, 'i': 7, 'm': 8, 'o': 9, 's': 10, 't': 11, 'x': 12}
  1. Supports both custom character sets and a predefined classic character set.
 # Predefined_classic_character_set
 
 r" !#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz{|}~"
  • Provides tokenization and detokenization functions.
 # For predefined character set

 "hello" => tokenize => [68, 65, 72, 72, 75]

 [68, 65, 72, 72, 75] => detokenize => "hello"
  • Allows saving and loading the character-to-index mapping dictionary to/from a file.
  • Supports padding or truncating tokenized sequences to a fixed length.
# Padding "hello" to length of 10 with values of 0

"hello" => tokenize => [68, 65, 72, 72, 75] => pad_sequence => [68, 65, 72, 72, 75, 0, 0, 0, 0, 0]

⬇️ Installation

chartokenizer is available as a PyPi package

PyPi - chartokenizer

You can install via pip:

pip install chartokenizer

✅ Usage

view - Documentation

from chartokenizer import Tokenizer

# Initialize the tokenizer
tokenizer = Tokenizer()

# Generate character-to-index mapping dictionary
dictionary = tokenizer.initialize(string="your_text_here")

# Tokenize a string
tokens = tokenizer.tokenize(dictionary, "your_text_here")

# Detokenize tokens back to string
text = tokenizer.detokenize(dictionary, tokens)

For more detailed usage and options, refer to the documentation


Contributing

Contributions are welcome! If you encounter any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request on GitHub.


License

Released under Apache by @MrTechyWorker.


Acknowledgments

  • This package was inspired by the need for a simple and efficient character-level tokenizer in natural language processing tasks.

Learn, Build, Develop !! 😉


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chartokenizer-1.0.0.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chartokenizer-1.0.0-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file chartokenizer-1.0.0.tar.gz.

File metadata

  • Download URL: chartokenizer-1.0.0.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.2

File hashes

Hashes for chartokenizer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 dc8853fd5ae58d27030abfa2692146cdfd6b023132371be55dd874084da53703
MD5 2fc658dac59d83e4a6e3a83935f4525f
BLAKE2b-256 bcfb0d84c8ee34fa4b3c4e02392557f24c8074286055aa74e08d86fa46c151ed

See more details on using hashes here.

File details

Details for the file chartokenizer-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: chartokenizer-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.2

File hashes

Hashes for chartokenizer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a80b449b4fcb34822c7b8b51c69c1d61fe5b042c0e50a6955bfc5d5fd5e48aa7
MD5 266df555dd0a26fc8995534050c62724
BLAKE2b-256 157679ad02c6e42384f9d47568adae363579744be7dcbad91304574cf71f8324

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page