Skip to main content

Fast token count estimation library

Project description

skimtoken (Beta)

A lightweight, fast token count estimation library written in Rust with Python bindings. Built for applications where approximate token counts work fine and memory/startup time efficiency matters.

Why skimtoken?

tiktoken is great for precise tokenization, but comes with serious overhead for simple token counting - especially memory usage and initialization time:

./scripts/run_benchmark_multiple.sh
╭────────────────── Mean Results After 100 Runs ─────────────────╮
│ Mean RMSE: 12.5526 tokens                                      │
├─────────────────┬──────────────┬──────────────┬────────────────┤
│ Metric          │   tiktoken   │  skimtoken   │     Ratio      │
├─────────────────┼──────────────┼──────────────┼────────────────┤
│ Init Time       │   0.135954 s │   0.001022 s │         0.007x │
│ Init Memory     │    84.5169 MB│     0.4292 MB│         0.005x │
│ Exec Time       │   0.002947 s │   0.113127 s │        38.387x │
│ Exec Memory     │     0.6602 MB│     0.0485 MB│         0.073x │
├─────────────────┼──────────────┼──────────────┼────────────────┤
│ TOTAL Time      │   0.138901 s │   0.114149 s │         0.821x │
│ TOTAL Memory    │    85.1770 MB│     0.4777 MB│         0.005x │
╰─────────────────┴──────────────┴──────────────┴────────────────╯

Memory Advantages

skimtoken uses >99% less memory than tiktoken:

  • tiktoken: ~85MB for initialization (loading vocabulary and encoder files)
  • skimtoken: ~0.43MB for initialization, ~0.48MB total peak usage
  • 178x less memory usage - perfect for memory-constrained environments

Memory-Efficient Design:

  • No large vocabulary files to load into memory
  • Minimal runtime memory footprint
  • Predictable memory usage patterns

Performance Trade-offs: skimtoken targets memory-constrained scenarios and cold-start environments where initialization time directly impacts user experience. While tiktoken is faster for individual operations (~38x) and more accurate, skimtoken's minimal initialization overhead (133x faster startup, 178x less memory) makes it 1.22x faster overall when you need to load fresh each time.

This makes skimtoken valuable in:

  • Serverless functions with strict memory limits (128MB-512MB)
  • Edge computing environments with limited RAM
  • Mobile applications where memory matters
  • Containerized microservices with tight memory constraints
  • Shared hosting environments where memory usage affects cost

Installation

pip install skimtoken

Usage

from skimtoken import estimate_tokens

# Basic usage
text = "Hello, world! How are you today?"
token_count = estimate_tokens(text)
print(f"Estimated tokens: {token_count}")

# Works with any text
code = """
def hello_world():
    print("Hello, world!")
    return True
"""
tokens = estimate_tokens(code)
print(f"Code tokens: {tokens}")

Language Support

skimtoken uses language-specific parameters tailored for different language families to improve estimation accuracy. Each language family has its own optimized coefficients based on tokenization patterns.

Supported languages: English, French, Spanish, German, Russian, Hindi, Arabic, Chinese, Japanese, Korean, etc.

Current Accuracy: RMSE of 12.55 across 146 samples (11,745 characters) with testing across multiple language families and text types

When to Use skimtoken vs tiktoken

Use skimtoken when:

  • Working in serverless/edge environments (Cloudflare Workers, AWS Lambda, Vercel Functions) where cold start time and memory usage matter
  • You need quick token estimates for API planning and cost estimation
  • Initialization overhead is a concern (e.g., short-lived processes that can't amortize tiktoken's startup cost)
  • Approximate counts work for your use case
  • Memory constraints are tight

Use Tiktoken when:

  • You need exact token counts for specific models and tokenization-dependent features
  • Processing large batches of text where you can load the encoder once and reuse it
  • Building applications that require precise tokenization (not just counting)
  • You have persistent memory and can afford tiktoken's initialization cost
  • Accuracy is more important than speed/memory efficiency

Key Trade-off: While tiktoken is faster for individual tokenization operations and more accurate, skimtoken excels in environments where you can't afford to keep encoders loaded in memory or where cold start performance matters more than raw throughput.

Roadmap

Automated Parameter Optimization: Plans to implement hyperparameter tuning using large-scale datasets like CC100 samples to minimize RMSE scores across language families.

The goal is to achieve sub-10 RMSE for major language families while preserving skimtoken's core advantages of minimal initialization overhead and memory usage.

Testing & Development

# Install dependencies
uv sync

# Build for development
uv run maturin dev --features python

# Run tests
cargo test
uv run pytest

# Run specific test with verbose output
uv run pytest tests/test_skimtoken_simple.py -s

# Run performance benchmarks
uv run scripts/benchmark.py

Test Results

Run accuracy testing:

uv run pytest tests/test_skimtoken_simple.py -s
RMSE by Category:
╭───────────────────────┬───────┬─────────┬────────╮
│ Category              │  RMSE │ Samples │ Status │
├───────────────────────┼───────┼─────────┼────────┤
│ ambiguous_punctuation │  2.88 │       7 │ ✓ PASS │
│ code                  │ 10.15 │      14 │ ✓ PASS │
│ edge                  │  0.00 │       2 │ ✓ PASS │
│ json                  │  8.54 │       3 │ ✓ PASS │
│ jsonl                 │ 15.51 │       2 │ ✓ PASS │
│ mixed                 │  4.12 │       3 │ ✓ PASS │
│ noisy_text            │  4.04 │       7 │ ✓ PASS │
│ repetitive            │  7.25 │       4 │ ✓ PASS │
│ rtl                   │  3.71 │       4 │ ✓ PASS │
│ special               │  4.69 │       3 │ ✓ PASS │
│ special_encoding      │ 10.65 │       8 │ ✓ PASS │
│ structured_format     │  3.82 │       8 │ ✓ PASS │
│ unknown               │ 15.43 │      81 │ ✓ PASS │
╰───────────────────────┴───────┴─────────┴────────╯

Summary Statistics:
Overall RMSE: 12.55 tokens
Total samples processed: 146
Total characters: 12,377
Execution time: 0.121 seconds
Processing speed: 1204 samples/second
Character throughput: 102,110 chars/second
Average per character: 9.793μs

Contributing

Contributions are welcome! Feel free to submit issues or pull requests.

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skimtoken-0.1.2.tar.gz (62.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skimtoken-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (301.0 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

File details

Details for the file skimtoken-0.1.2.tar.gz.

File metadata

  • Download URL: skimtoken-0.1.2.tar.gz
  • Upload date:
  • Size: 62.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.0

File hashes

Hashes for skimtoken-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2e815131e583c32b834df9d4537c6cfc11a6b6d6350361977a88ed6d25d29322
MD5 526aaf63ab2d39e62c496f25f391eb20
BLAKE2b-256 cff2584d43c3bc987a3715dfbd09c1ea767a92b1621cffd8bc2e50dd878bfa69

See more details on using hashes here.

File details

Details for the file skimtoken-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for skimtoken-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4b31750709631b6a17f7e690ba3be01a11bbef9a4ab869ac8f33d0240c5dd967
MD5 aa8241f97ecffd6b688bfe06b454e898
BLAKE2b-256 1275212388d003361373da1a0116a9d564d96f040f818e0f84e98c891fd5dd2f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page