Skip to main content

fast bpe tokenizer used for experimental models

Project description

Shredword

A fast and efficient tokenizer library for natural language processing tasks, built with Python and optimized C backend.

Features

  • High Performance: Fast tokenization powered by optimized C libraries
  • Multiple Encodings: Support for various tokenization models and vocabularies
  • Flexible API: Easy-to-use Python interface with comprehensive functionality
  • Special Tokens: Built-in support for special tokens and custom vocabularies
  • Fallback Mechanisms: Robust error handling with fallback tokenization
  • BPE Support: Byte Pair Encoding implementation for subword tokenization

Installation

pip install shredword

Quick Start

from shred import load_encoding

# Load a tokenizer
tokenizer = load_encoding("pre_16k")

# Encode text to tokens
tokens = tokenizer.encode("Hello, world!")
print(tokens)  # [10478, 10408, 10416, 10416, ...

# Decode tokens back to text
text = tokenizer.decode(tokens)
print(text)  # "Hello, world!"

# Get vocabulary information
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Special tokens: {tokenizer.special_tokens}")

Documentation

For detailed usage instructions, API reference, and examples, please see our User Documentation.

Supported Encodings

Shredword supports various pre-trained tokenization models. The library automatically downloads vocabulary files from the official repository when needed.

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

  1. Clone the repository
  2. Install development dependencies: pip install -r requirements.txt
  3. Run tests: python -m pytest

Guidelines

  • Follow PEP 8 style guidelines
  • Add tests for new features
  • Update documentation as needed
  • Ensure all tests pass before submitting PRs

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Support

Acknowledgments

Built with performance and simplicity in mind for the NLP community.


Note: This library requires a C/CPP compiler for optimal performance. Fallback Python implementations are available when C/CPP extensions are not available.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shredword-0.1.0.tar.gz (609.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shredword-0.1.0-cp313-cp313-win_amd64.whl (27.7 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file shredword-0.1.0.tar.gz.

File metadata

  • Download URL: shredword-0.1.0.tar.gz
  • Upload date:
  • Size: 609.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for shredword-0.1.0.tar.gz
Algorithm Hash digest
SHA256 aedb8ce5b5ebf37d6227830519fff30836c6fa9348e80b7823ac6ba71ae35c6b
MD5 6e6474da919623bbfcf383c1370844c0
BLAKE2b-256 269696708675604c3f75570e678564b56ecd9d346ee8dd7c8304c1f0ca94be3d

See more details on using hashes here.

File details

Details for the file shredword-0.1.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: shredword-0.1.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for shredword-0.1.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 1655307a1357bc67a1f7f38ada149b70764c9a658243bce6f7f1e603fcf3456a
MD5 7dd6886363e1bda0788bcd5f678dc08b
BLAKE2b-256 d26266ff870fdd2def304a1cab5c5e0af188bc7c3ffbe869ff3c426602eec519

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page