Production-grade token counter for PDF, TXT, DOCX, MD, and PPTX files using real GPT tokenization via tiktoken
Project description
Token Calculator
Production-grade token counter for PDF, TXT, DOCX, MD, and PPTX files using real GPT tokenization via tiktoken
Features • Installation • Quick Start • API • CLI
Features
✨ Real GPT Tokenization - Uses OpenAI's tiktoken library for accurate token counting (not approximations)
⚡ Multi-Format Support - Handles PDF, TXT, MD, DOCX, PPTX files seamlessly
🔄 Streaming Architecture - Constant-memory processing for large files (handles 500MB+ efficiently)
🚀 Adaptive Concurrency - Automatic CPU detection and parallel batch processing
🔍 OCR/Scanned PDF Detection - Automatically identifies whether PDFs are text-based or scanned images
⏱️ Timeout Enforcement - Prevents hanging on malformed or problematic files
💾 Memory Protected - Soft (512MB) and hard (2048MB) memory limits with enforcement
🎯 Enterprise Ready - Designed for RAG, LLM preprocessing, token budgeting, and bulk analytics
Installation
From PyPI
pip install doctok
From source
git clone https://github.com/Pranesh-2005/Token-Calculator.git
cd Token-Calculator
pip install -e .
Quick Start
Command Line
# Count tokens in a single file
doctok document.pdf
# Process an entire directory
doctok ./documents/
# Output results as JSON
doctok document.pdf --output results.json --format json
# Compute SHA256 hashes
doctok document.pdf --hash
# Specify timeout (seconds)
doctok large-file.pdf --timeout 600
# Use multiple workers (default: auto-detected)
doctok ./documents/ --workers 8
Python API
from token_calculator import count_file, count_files_batch
# Single file
result = count_file("document.pdf")
print(f"Tokens: {result.gpt_tokens}")
print(f"Words: {result.word_count}")
print(f"Characters: {result.char_count}")
print(f"Pages: {result.pages}")
print(f"Status: {result.status}") # 'ok', 'scanned', 'timeout', 'error'
# Batch processing
results = count_files_batch(
["doc1.pdf", "doc2.docx", "doc3.txt"],
max_workers=4,
compute_hash=True
)
print(f"Total tokens: {results.total_tokens}")
print(f"Total pages: {results.total_pages}")
print(f"Elapsed time: {results.elapsed_sec:.2f}s")
API Reference
Core Functions
count_file(path: str, timeout_sec: float = 300, compute_hash: bool = False) -> FileResult
Process a single file and return detailed metrics.
Parameters:
path(str): Path to the file to processtimeout_sec(float): Timeout in seconds (default: 300)compute_hash(bool): Compute SHA256 hash (default: False)
Returns:
FileResultobject with:gpt_tokens(int): Token count using GPT tokenizerchar_count(int): Character countword_count(int): Word countpages(int): Page countextractable_pages(int): Pages with extractable textskipped_pages(int): Pages without text (scanned PDFs)status(str): 'ok', 'scanned', 'timeout', 'error'error_msg(str): Error details if applicableelapsed_sec(float): Processing time
Example:
from token_calculator import count_file
result = count_file("paper.pdf")
if result.status == "ok":
print(f"{result.gpt_tokens} tokens")
elif result.status == "scanned":
print("PDF is scanned (no OCR)")
else:
print(f"Error: {result.error_msg}")
count_files_batch(paths: list[str], max_workers: Optional[int] = None, compute_hash: bool = False) -> BatchResult
Process multiple files in parallel.
Parameters:
paths(list[str]): List of file pathsmax_workers(int): Max parallel workers (default: auto-detected, max 8)compute_hash(bool): Compute SHA256 hashes (default: False)
Returns:
BatchResultobject with:files(list[FileResult]): Results for each filetotal_tokens(int): Sum of all tokenstotal_words(int): Sum of all wordstotal_chars(int): Sum of all characterstotal_pages(int): Sum of all pagesfile_count(int): Number of files processederror_count(int): Number of failed fileselapsed_sec(float): Total elapsed time
Example:
from token_calculator import count_files_batch
import json
results = count_files_batch([
"doc1.pdf",
"doc2.docx",
"doc3.txt"
])
# Export as JSON
output = {
"total_tokens": results.total_tokens,
"files": [
{
"name": f.filename,
"tokens": f.gpt_tokens,
"status": f.status
}
for f in results.files
]
}
print(json.dumps(output, indent=2))
Exception Handling
from token_calculator import (
count_file,
TokenCounterError,
FileTooLargeError,
UnsupportedFileError,
FileTimeoutError,
)
try:
result = count_file("document.pdf")
except FileTooLargeError:
print("File exceeds 500MB limit")
except UnsupportedFileError as e:
print(f"Format not supported: {e}")
except FileTimeoutError:
print("Processing timed out")
except TokenCounterError as e:
print(f"Processing error: {e}")
Supported Formats
| Format | Extension | Features |
|---|---|---|
.pdf |
Text extraction, page count, scanned PDF detection | |
| Plain Text | .txt |
Encoding auto-detection, streaming processing |
| Markdown | .md |
Standard markdown text extraction |
| Word | .docx |
Paragraph and table extraction |
| PowerPoint | .pptx |
Slide and shape text extraction |
Configuration
Default Limits
Max file size: 500 MB
Global timeout: 300 seconds
PDF page timeout: 10 seconds
Soft memory limit: 512 MB
Hard memory limit: 2048 MB
Max workers: CPU count (capped at 8)
To use different limits, process files locally and modify core.py constants.
Performance
- Small files (< 10MB): < 1 second
- Medium files (10-100MB): 2-10 seconds
- Large files (100-500MB): 30-120 seconds
- Batch of 100 files: ~2 minutes (with 8 workers)
Memory usage remains constant regardless of file size due to streaming architecture.
Use Cases
- 📊 RAG Pipeline Optimization - Calculate token costs before ingestion
- 🧠 LLM Preprocessing - Validate document sizes for model context windows
- 💰 Token Budgeting - Estimate API costs for document processing
- 📈 Enterprise Analytics - Batch analyze large document corpora
- 🔍 Content Indexing - Organize documents by token complexity
Examples
Example 1: Calculate API Cost
from token_calculator import count_file
result = count_file("paper.pdf")
# Pricing: $0.01 per 1K tokens (example)
cost = (result.gpt_tokens / 1000) * 0.01
print(f"Estimated API cost: ${cost:.2f}")
Example 2: Batch Processing with Error Handling
from token_calculator import count_files_batch
from pathlib import Path
pdf_files = list(Path("./documents").glob("*.pdf"))
results = count_files_batch([str(p) for p in pdf_files], max_workers=8)
print(f"Successfully processed: {results.file_count - results.error_count}")
print(f"Failed: {results.error_count}")
print(f"Total tokens: {results.total_tokens:,}")
for file_result in results.files:
if file_result.status != "ok":
print(f"⚠️ {file_result.filename}: {file_result.error_msg}")
Example 3: RAG Ingestion Check
from token_calculator import count_file
# Check if document fits in context window
MAX_CONTEXT = 8000 # tokens
result = count_file("document.pdf")
if result.gpt_tokens > MAX_CONTEXT:
print(f"⚠️ Document is {result.gpt_tokens} tokens (exceeds {MAX_CONTEXT})")
print(f"Split into {result.gpt_tokens // MAX_CONTEXT + 1} chunks")
else:
print(f"✓ Document fits in context ({result.gpt_tokens} tokens)")
CLI Options
usage: token_calculator [-h] [-w WORKERS] [-t TIMEOUT] [-o OUTPUT] [--hash] [--format {text,json}] path
positional arguments:
path File or directory to process
optional arguments:
-h, --help Show help message
-w, --workers N Max parallel workers (default: auto)
-t, --timeout SEC Per-file timeout in seconds (default: 300)
-o, --output FILE Write JSON results to file
--hash Compute SHA256 hash for each file
--format {text,json} Output format (default: text)
Logging
import logging
from token_calculator import setup_logging, count_file
# Enable debug logging
setup_logging(logging.DEBUG)
result = count_file("document.pdf")
License
MIT License - see LICENSE file for details
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Make your changes
- Run tests (
pytest) - Submit a pull request
Support
Changelog
v1.0.0 (Initial Release)
- Multi-format document support (PDF, TXT, MD, DOCX, PPTX)
- Real GPT tokenization via tiktoken
- Streaming extraction with constant-memory processing
- Batch processing with adaptive concurrency
- OCR/scanned PDF detection
- CLI and Python API
- Comprehensive error handling
- Memory protection and timeout enforcement
Roadmap
- Async API support
- Web API endpoint
- Language-specific tokenizer support
- Streaming file uploads
- Token cost calculator for multiple LLM providers
Made with ❤️ by Pranesh
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doctok-1.0.0.tar.gz.
File metadata
- Download URL: doctok-1.0.0.tar.gz
- Upload date:
- Size: 12.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04f71639cb281a2d9dfb4fa7ffb9d0567dc6c46146f50a1c5dffd1d77d16b456
|
|
| MD5 |
ee31b538b0dcc77935ed249ae9b22003
|
|
| BLAKE2b-256 |
7164a8a91ba8c34c807b40f6c70f36c444ee73f6ef1c6a51dee149f96751bcd6
|
File details
Details for the file doctok-1.0.0-py3-none-any.whl.
File metadata
- Download URL: doctok-1.0.0-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e702a231dcc9386135e10f2cd1a04a8b6ce05f28b774b259d268710a53b454d3
|
|
| MD5 |
d779ad848e2567ca4bd02e1b1bd4f467
|
|
| BLAKE2b-256 |
d7d0ddb29bc778f6747d60ba1dbc30edb7a68e8ebcad23efd9131cfae74a6d57
|