fast bpe tokenizer used for experimental models

These details have not been verified by PyPI

Project links

Homepage

Project description

Shredword

A fast and efficient tokenizer library for natural language processing tasks, built with Python and optimized C backend.

Features

High Performance: Fast tokenization powered by optimized C libraries
Multiple Encodings: Support for various tokenization models and vocabularies
Flexible API: Easy-to-use Python interface with comprehensive functionality
Special Tokens: Built-in support for special tokens and custom vocabularies
Fallback Mechanisms: Robust error handling with fallback tokenization
BPE Support: Byte Pair Encoding implementation for subword tokenization

Installation

pip install shredword

Quick Start

from shred import load_encoding

# Load a tokenizer
tokenizer = load_encoding("pre_16k")

# Encode text to tokens
tokens = tokenizer.encode("Hello, world!")
print(tokens)  # [10478, 10408, 10416, 10416, ...

# Decode tokens back to text
text = tokenizer.decode(tokens)
print(text)  # "Hello, world!"

# Get vocabulary information
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Special tokens: {tokenizer.special_tokens}")

Documentation

For detailed usage instructions, API reference, and examples, please see our User Documentation.

Supported Encodings

Shredword supports various pre-trained tokenization models. The library automatically downloads vocabulary files from the official repository when needed.

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

Clone the repository
Install development dependencies: pip install -r requirements.txt
Run tests: python -m pytest

Guidelines

Follow PEP 8 style guidelines
Add tests for new features
Update documentation as needed
Ensure all tests pass before submitting PRs

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Support

Issues: Report bugs or request features on GitHub Issues
Discussions: Join community discussions on GitHub Discussions

Acknowledgments

Built with performance and simplicity in mind for the NLP community.

Note: This library requires a C/CPP compiler for optimal performance. Fallback Python implementations are available when C/CPP extensions are not available.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.2

Dec 19, 2025

0.1.1

Oct 3, 2025

0.1.0

Jul 1, 2025

This version

0.0.5

Jun 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shredword-0.0.5.tar.gz (86.1 kB view details)

Uploaded Jun 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

shredword-0.0.5-py3-none-any.whl (89.5 kB view details)

Uploaded Jun 23, 2025 Python 3

File details

Details for the file shredword-0.0.5.tar.gz.

File metadata

Download URL: shredword-0.0.5.tar.gz
Upload date: Jun 23, 2025
Size: 86.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for shredword-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`55c5b58492bef9f9b3cf2567d54dd8904274d584b0b6dfc03468997396949dc9`
MD5	`5ecaaa0169047e875140fce71e15a4e3`
BLAKE2b-256	`18d47fdb08b3922181972de5f64ad577f04ab908ad4fd0f013800389b697a404`

See more details on using hashes here.

File details

Details for the file shredword-0.0.5-py3-none-any.whl.

File metadata

Download URL: shredword-0.0.5-py3-none-any.whl
Upload date: Jun 23, 2025
Size: 89.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for shredword-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`357e70c3ad8db2e60d041759e6709a133f85646b75bbf2678e05060ff18538db`
MD5	`d782c4af740bdff3aae91b00970fcf3a`
BLAKE2b-256	`81bbdcc17aa87ccfbe6d87f1ebe6e63cc28a251ca77f8df4fa5f747da87b00ad`

See more details on using hashes here.

shredword 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Shredword

Features

Installation

Quick Start

Documentation

Supported Encodings

Contributing

Development Setup

Guidelines

License

Support

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes