Skip to main content

fast bpe tokenizer used for experimental models

Project description

Shredword

A fast and efficient tokenizer library for natural language processing tasks, built with Python and optimized C backend.

Features

  • High Performance: Fast tokenization powered by optimized C libraries
  • Multiple Encodings: Support for various tokenization models and vocabularies
  • Flexible API: Easy-to-use Python interface with comprehensive functionality
  • Special Tokens: Built-in support for special tokens and custom vocabularies
  • Fallback Mechanisms: Robust error handling with fallback tokenization
  • BPE Support: Byte Pair Encoding implementation for subword tokenization

Installation

pip install shredword

Quick Start

from shred import load_encoding

# Load a tokenizer
tokenizer = load_encoding("pre_16k")

# Encode text to tokens
tokens = tokenizer.encode("Hello, world!")
print(tokens)  # [10478, 10408, 10416, 10416, ...

# Decode tokens back to text
text = tokenizer.decode(tokens)
print(text)  # "Hello, world!"

# Get vocabulary information
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Special tokens: {tokenizer.special_tokens}")

Documentation

For detailed usage instructions, API reference, and examples, please see our User Documentation.

Supported Encodings

Shredword supports various pre-trained tokenization models. The library automatically downloads vocabulary files from the official repository when needed.

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

  1. Clone the repository
  2. Install development dependencies: pip install -r requirements.txt
  3. Run tests: python -m pytest

Guidelines

  • Follow PEP 8 style guidelines
  • Add tests for new features
  • Update documentation as needed
  • Ensure all tests pass before submitting PRs

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Support

Acknowledgments

Built with performance and simplicity in mind for the NLP community.


Note: This library requires a C/CPP compiler for optimal performance. Fallback Python implementations are available when C/CPP extensions are not available.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shredword-0.0.5.tar.gz (86.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shredword-0.0.5-py3-none-any.whl (89.5 kB view details)

Uploaded Python 3

File details

Details for the file shredword-0.0.5.tar.gz.

File metadata

  • Download URL: shredword-0.0.5.tar.gz
  • Upload date:
  • Size: 86.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for shredword-0.0.5.tar.gz
Algorithm Hash digest
SHA256 55c5b58492bef9f9b3cf2567d54dd8904274d584b0b6dfc03468997396949dc9
MD5 5ecaaa0169047e875140fce71e15a4e3
BLAKE2b-256 18d47fdb08b3922181972de5f64ad577f04ab908ad4fd0f013800389b697a404

See more details on using hashes here.

File details

Details for the file shredword-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: shredword-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 89.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for shredword-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 357e70c3ad8db2e60d041759e6709a133f85646b75bbf2678e05060ff18538db
MD5 d782c4af740bdff3aae91b00970fcf3a
BLAKE2b-256 81bbdcc17aa87ccfbe6d87f1ebe6e63cc28a251ca77f8df4fa5f747da87b00ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page