Fast token count estimation library

Project description

skimtoken (Early Beta)

⚠️ WARNING: This is an early beta version. The current implementation is not production-ready.

A lightweight, fast token count estimation library written in Rust with Python bindings.

⚠️ Current Limitations

This library is currently in early beta and has significant accuracy issues:

Multilingual method: Takes 48.60x longer than tiktoken due to inefficient implementation
Overall accuracy: 15.11% error rate, which is too high for most use cases

Why skimtoken?

The Problem: tiktoken is great for precise tokenization, but requires ~60MB of memory just to count tokens - problematic for memory-constrained environments.

The Solution: skimtoken estimates token counts using statistical patterns instead of loading entire vocabularies, achieving:

✅ 64x less memory (0.92MB vs 60MB)
✅ 128x faster startup (4ms vs 485ms)
❌ 48.60x slower execution (0.93s vs 4.59s) for multilingual method
❌ Trade-off: ~15.11% error rate vs exact counts

Installation

pip install skimtoken

Requirements: Python 3.9+

Quick Start

Simple method (Just char length x coefficient):

from skimtoken import estimate_tokens

# Basic usage
text = "Hello, world! How are you today?"
token_count = estimate_tokens(text)
print(f"Estimated tokens: {token_count}")

Multilingual simple method:

from skimtoken.multilingual_single import estimate_tokens

multilingual_text = """
For non-space separated languages, the number of tokens is difficult to predict.
スペースで区切られていない言語の場合トークン数を予測するのは難しいです。
स्पेसद्वारावियोजितनहींभाषाओंकेलिएटोकनकीसंख्याकाअनुमानलगानाकठिनहै।
بالنسبةللغاتالتيلاتفصلبمسافاتفإنالتنبؤبعددالرموزصعب
"""
token_count = estimate_tokens(multilingual_text)
print(f"Estimated tokens (multilingual): {token_count}")

When to Use skimtoken

✅ Perfect for:

Use Case	Why It Works	Example
Rate Limiting	Overestimating is safe	Prevent API quota exceeded
Cost Estimation	Users prefer conservative estimates	"$0.13" (actual: $0.10)
Progress Bars	Approximate progress is fine	Processing documents
Serverless/Edge	Memory constraints (128MB limits)	Cloudflare Workers
Quick Filtering	Remove obviously too-long content	Pre-screening
Model Switching	Switch to smart model when context long	Auto-escalation

❌ Not suitable for:

Use Case	Why It Fails	Use Instead
Context Limits	Underestimating causes failures	tiktoken
Exact Billing	15% error = unhappy customers	tiktoken
Token Splitting	Chunks might exceed limits	tiktoken
Embeddings	Need exact token boundaries	tiktoken

Performance Comparison

Large-Scale Benchmark (100k samples)

Simple method (Just char length x coefficient):

Results:
Total Samples: 100,726
Total Characters: 13,062,391
Mean RMSE: 38.4863 tokens
Mean Error Rate: 21.63%

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┓
┃ Metric       ┃   tiktoken ┃  skimtoken ┃  Ratio ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━┩
│ Init Time    │ 0.481672 s │ 0.182308 s │ 0.378x │
├──────────────┼────────────┼────────────┼────────┤
│ Init Memory  │ 42.2386 MB │  0.0291 MB │ 0.001x │
├──────────────┼────────────┼────────────┼────────┤
│ Exec Time    │ 4.710224 s │ 0.805272 s │ 0.171x │
├──────────────┼────────────┼────────────┼────────┤
│ Exec Memory  │ 17.3251 MB │  0.8849 MB │ 0.051x │
├──────────────┼────────────┼────────────┼────────┤
│ Total Time   │ 5.191896 s │ 0.928758 s │ 0.190x │
├──────────────┼────────────┼────────────┼────────┤
│ Total Memory │ 59.5637 MB │  0.9214 MB │ 0.015x │
└──────────────┴────────────┴────────────┴────────┘

Multilingual simple method:

Results:
Total Samples: 100,726
Total Characters: 13,062,391
Mean RMSE: 21.3034 tokens
Mean Error Rate: 15.11%

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Metric       ┃   tiktoken ┃    skimtoken ┃   Ratio ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ Init Time    │ 0.471222 s │   0.006207 s │  0.013x │
├──────────────┼────────────┼──────────────┼─────────┤
│ Init Memory  │ 42.2385 MB │    0.0283 MB │  0.001x │
├──────────────┼────────────┼──────────────┼─────────┤
│ Exec Time    │ 4.594160 s │ 246.164618 s │ 53.582x │
├──────────────┼────────────┼──────────────┼─────────┤
│ Exec Memory  │ 17.3251 MB │    0.8950 MB │  0.052x │
├──────────────┼────────────┼──────────────┼─────────┤
│ Total Time   │ 5.065382 s │ 246.170825 s │ 48.599x │
├──────────────┼────────────┼──────────────┼─────────┤
│ Total Memory │ 59.5636 MB │    0.9233 MB │  0.016x │
└──────────────┴────────────┴──────────────┴─────────┘

Available Methods

Method	Import	Memory	Error	Best For
Simple	`from skimtoken.simple import estimate_tokens`	0.8MB	~21%	English text, minimum memory
Basic	`from skimtoken.basic import estimate_tokens`	0.8MB	~27%	General use
Multilingual	`from skimtoken.multilingual import estimate_tokens`	0.9MB	~15%	Non-English, mixed languages

# Example: Choose method based on your needs
if memory_critical:
    from skimtoken.simple import estimate_tokens
elif mixed_languages:
    from skimtoken.multilingual import estimate_tokens
else:
    from skimtoken import estimate_tokens  # Default: simple

CLI Usage

# From command line
echo "Hello, world!" | skimtoken
# Output: 5

# From file
skimtoken -f document.txt
# Output: 236

# Multiple files
cat *.md | skimtoken
# Output: 4846

How It Works

Unlike tiktoken's vocabulary-based approach, skimtoken uses statistical patterns:

tiktoken:

Text → Tokenizer → ["Hello", ",", " world"] → Vocabulary Lookup → [1234, 11, 4567] → Count: 3
                                                      ↑
                                              Requires 60MB dictionary

skimtoken:

Text → Feature Extraction → {chars: 13, words: 2, lang: "en"} → Statistical Model → ~3 tokens
                                                                         ↑
                                                                  Only 0.92MB of parameters

Advanced Usage

Optimize for Your Domain

Improve accuracy on domain-specific content:

# 1. Prepare labeled data
# Format: {"text": "your content", "actual_tokens": 123}
uv run scripts/prepare_dataset.py --input your_texts.txt

# 2. Optimize parameters
uv run scripts/optimize_all.py --dataset your_data.jsonl

# 3. Rebuild with custom parameters
uv run maturin build --release

Architecture

skimtoken/
├── src/
│   ├── lib.rs                        # Core Rust library with PyO3 bindings
│   └── methods/
│       ├── method_simple.rs          # Character-based estimation
│       ├── method_basic.rs           # Multi-feature regression  
│       └── method_multilingual.rs    # Language-aware estimation
├── skimtoken/                        # Python package
│   ├── __init__.py                   # Main API
│   └── {method}.py                   # Method-specific imports
├── params/                           # Learned parameters (TOML)
└── scripts/
    ├── benchmark.py                  # Performance testing
    └── optimize/                     # Parameter training

Development

# Setup
git clone https://github.com/masaishi/skimtoken
cd skimtoken
uv sync

# Development build
uv run maturin dev --features python

# Run tests
cargo test
uv run pytest

# Benchmark
uv run scripts/benchmark.py

FAQ

Q: Can I improve accuracy?
A: Yes! You can adjust the parameters using your own data to improve accuracy. See Advanced Usage for details.

Q: Is the API stable?
A: Beta = breaking changes possible.

Future Plans

We are actively working to improve skimtoken's accuracy and performance:

Better estimation algorithms: Moving beyond simple character multiplication to more sophisticated statistical models
Performance optimization: Fixing the 60x slowdown in multilingual method
Improved language support: Better handling of non-English languages
Higher accuracy: Targeting <10% error rate while maintaining low memory footprint

License

MIT License - see LICENSE for details.

Project details

Release history Release notifications | RSS feed

0.2.2

Jul 8, 2025

0.2.1

Jul 6, 2025

This version

0.2.0

Jul 6, 2025

0.1.2

Jul 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skimtoken-0.2.0.tar.gz (215.7 kB view details)

Uploaded Jul 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skimtoken-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (479.8 kB view details)

Uploaded Jul 6, 2025 CPython 3.9+manylinux: glibc 2.17+ x86-64

File details

Details for the file skimtoken-0.2.0.tar.gz.

File metadata

Download URL: skimtoken-0.2.0.tar.gz
Upload date: Jul 6, 2025
Size: 215.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.0

File hashes

Hashes for skimtoken-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`30dcc402de0c0bc6e650e6f2f1e515a35bf1b72b20bf9de720b5ae3a573aef0b`
MD5	`d2e4255ccedd00ca8e97ecc35275f685`
BLAKE2b-256	`2acf59b91c9c548b27a25ad1cca272e161e77026d09f013cff5ba232f8621653`

See more details on using hashes here.

File details

Details for the file skimtoken-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: skimtoken-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jul 6, 2025
Size: 479.8 kB
Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.0

File hashes

Hashes for skimtoken-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`506d28bd320a68dc5d5983cf99d9de8b35c031ae4bcf11bef243e5e8735a70d5`
MD5	`53bbb210c3825fb1f80ddee0c4dd3c54`
BLAKE2b-256	`1190b7ff12098289bcb1507011d4e6b3c5a09ceebc714bf99b3eb0c0c9ef0e45`

See more details on using hashes here.

skimtoken 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Project description

skimtoken (Early Beta)

⚠️ Current Limitations

Why skimtoken?

Installation

Quick Start

When to Use skimtoken

✅ Perfect for:

❌ Not suitable for:

Performance Comparison

Large-Scale Benchmark (100k samples)

Available Methods

CLI Usage

How It Works

Advanced Usage

Optimize for Your Domain

Architecture

Development

FAQ

Future Plans

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes