Skip to main content

Production-grade token counter for PDF, TXT, DOCX, MD, and PPTX files using real GPT tokenization via tiktoken

Project description

Token Calculator

Python 3.8+ License: MIT PyPI version

Production-grade token counter for PDF, TXT, DOCX, MD, and PPTX files using real GPT tokenization via tiktoken

FeaturesInstallationQuick StartAPICLI


Features

Real GPT Tokenization - Uses OpenAI's tiktoken library for accurate token counting (not approximations)

Multi-Format Support - Handles PDF, TXT, MD, DOCX, PPTX files seamlessly

🔄 Streaming Architecture - Constant-memory processing for large files (handles 500MB+ efficiently)

🚀 Adaptive Concurrency - Automatic CPU detection and parallel batch processing

🔍 OCR/Scanned PDF Detection - Automatically identifies whether PDFs are text-based or scanned images

⏱️ Timeout Enforcement - Prevents hanging on malformed or problematic files

💾 Memory Protected - Soft (512MB) and hard (2048MB) memory limits with enforcement

🎯 Enterprise Ready - Designed for RAG, LLM preprocessing, token budgeting, and bulk analytics


Installation

From PyPI

pip install doctok

From source

git clone https://github.com/Pranesh-2005/Token-Calculator.git
cd Token-Calculator
pip install -e .

Quick Start

Command Line

# Count tokens in a single file
doctok document.pdf

# Process an entire directory
doctok ./documents/

# Output results as JSON
doctok document.pdf --output results.json --format json

# Compute SHA256 hashes
doctok document.pdf --hash

# Specify timeout (seconds)
doctok large-file.pdf --timeout 600

# Use multiple workers (default: auto-detected)
doctok ./documents/ --workers 8

Python API

from token_calculator import count_file, count_files_batch

# Single file
result = count_file("document.pdf")
print(f"Tokens: {result.gpt_tokens}")
print(f"Words: {result.word_count}")
print(f"Characters: {result.char_count}")
print(f"Pages: {result.pages}")
print(f"Status: {result.status}")  # 'ok', 'scanned', 'timeout', 'error'

# Batch processing
results = count_files_batch(
    ["doc1.pdf", "doc2.docx", "doc3.txt"],
    max_workers=4,
    compute_hash=True
)
print(f"Total tokens: {results.total_tokens}")
print(f"Total pages: {results.total_pages}")
print(f"Elapsed time: {results.elapsed_sec:.2f}s")

API Reference

Core Functions

count_file(path: str, timeout_sec: float = 300, compute_hash: bool = False) -> FileResult

Process a single file and return detailed metrics.

Parameters:

  • path (str): Path to the file to process
  • timeout_sec (float): Timeout in seconds (default: 300)
  • compute_hash (bool): Compute SHA256 hash (default: False)

Returns:

  • FileResult object with:
    • gpt_tokens (int): Token count using GPT tokenizer
    • char_count (int): Character count
    • word_count (int): Word count
    • pages (int): Page count
    • extractable_pages (int): Pages with extractable text
    • skipped_pages (int): Pages without text (scanned PDFs)
    • status (str): 'ok', 'scanned', 'timeout', 'error'
    • error_msg (str): Error details if applicable
    • elapsed_sec (float): Processing time

Example:

from token_calculator import count_file

result = count_file("paper.pdf")
if result.status == "ok":
    print(f"{result.gpt_tokens} tokens")
elif result.status == "scanned":
    print("PDF is scanned (no OCR)")
else:
    print(f"Error: {result.error_msg}")

count_files_batch(paths: list[str], max_workers: Optional[int] = None, compute_hash: bool = False) -> BatchResult

Process multiple files in parallel.

Parameters:

  • paths (list[str]): List of file paths
  • max_workers (int): Max parallel workers (default: auto-detected, max 8)
  • compute_hash (bool): Compute SHA256 hashes (default: False)

Returns:

  • BatchResult object with:
    • files (list[FileResult]): Results for each file
    • total_tokens (int): Sum of all tokens
    • total_words (int): Sum of all words
    • total_chars (int): Sum of all characters
    • total_pages (int): Sum of all pages
    • file_count (int): Number of files processed
    • error_count (int): Number of failed files
    • elapsed_sec (float): Total elapsed time

Example:

from token_calculator import count_files_batch
import json

results = count_files_batch([
    "doc1.pdf",
    "doc2.docx",
    "doc3.txt"
])

# Export as JSON
output = {
    "total_tokens": results.total_tokens,
    "files": [
        {
            "name": f.filename,
            "tokens": f.gpt_tokens,
            "status": f.status
        }
        for f in results.files
    ]
}
print(json.dumps(output, indent=2))

Exception Handling

from token_calculator import (
    count_file,
    TokenCounterError,
    FileTooLargeError,
    UnsupportedFileError,
    FileTimeoutError,
)

try:
    result = count_file("document.pdf")
except FileTooLargeError:
    print("File exceeds 500MB limit")
except UnsupportedFileError as e:
    print(f"Format not supported: {e}")
except FileTimeoutError:
    print("Processing timed out")
except TokenCounterError as e:
    print(f"Processing error: {e}")

Supported Formats

Format Extension Features
PDF .pdf Text extraction, page count, scanned PDF detection
Plain Text .txt Encoding auto-detection, streaming processing
Markdown .md Standard markdown text extraction
Word .docx Paragraph and table extraction
PowerPoint .pptx Slide and shape text extraction

Configuration

Default Limits

Max file size: 500 MB
Global timeout: 300 seconds
PDF page timeout: 10 seconds
Soft memory limit: 512 MB
Hard memory limit: 2048 MB
Max workers: CPU count (capped at 8)

To use different limits, process files locally and modify core.py constants.


Performance

  • Small files (< 10MB): < 1 second
  • Medium files (10-100MB): 2-10 seconds
  • Large files (100-500MB): 30-120 seconds
  • Batch of 100 files: ~2 minutes (with 8 workers)

Memory usage remains constant regardless of file size due to streaming architecture.


Use Cases

  • 📊 RAG Pipeline Optimization - Calculate token costs before ingestion
  • 🧠 LLM Preprocessing - Validate document sizes for model context windows
  • 💰 Token Budgeting - Estimate API costs for document processing
  • 📈 Enterprise Analytics - Batch analyze large document corpora
  • 🔍 Content Indexing - Organize documents by token complexity

Examples

Example 1: Calculate API Cost

from token_calculator import count_file

result = count_file("paper.pdf")

# Pricing: $0.01 per 1K tokens (example)
cost = (result.gpt_tokens / 1000) * 0.01
print(f"Estimated API cost: ${cost:.2f}")

Example 2: Batch Processing with Error Handling

from token_calculator import count_files_batch
from pathlib import Path

pdf_files = list(Path("./documents").glob("*.pdf"))

results = count_files_batch([str(p) for p in pdf_files], max_workers=8)

print(f"Successfully processed: {results.file_count - results.error_count}")
print(f"Failed: {results.error_count}")
print(f"Total tokens: {results.total_tokens:,}")

for file_result in results.files:
    if file_result.status != "ok":
        print(f"⚠️  {file_result.filename}: {file_result.error_msg}")

Example 3: RAG Ingestion Check

from token_calculator import count_file

# Check if document fits in context window
MAX_CONTEXT = 8000  # tokens

result = count_file("document.pdf")

if result.gpt_tokens > MAX_CONTEXT:
    print(f"⚠️  Document is {result.gpt_tokens} tokens (exceeds {MAX_CONTEXT})")
    print(f"Split into {result.gpt_tokens // MAX_CONTEXT + 1} chunks")
else:
    print(f"✓ Document fits in context ({result.gpt_tokens} tokens)")

CLI Options

usage: token_calculator [-h] [-w WORKERS] [-t TIMEOUT] [-o OUTPUT] [--hash] [--format {text,json}] path

positional arguments:
  path                  File or directory to process

optional arguments:
  -h, --help           Show help message
  -w, --workers N      Max parallel workers (default: auto)
  -t, --timeout SEC    Per-file timeout in seconds (default: 300)
  -o, --output FILE    Write JSON results to file
  --hash               Compute SHA256 hash for each file
  --format {text,json} Output format (default: text)

Logging

import logging
from token_calculator import setup_logging, count_file

# Enable debug logging
setup_logging(logging.DEBUG)

result = count_file("document.pdf")

License

MIT License - see LICENSE file for details


Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/improvement)
  3. Make your changes
  4. Run tests (pytest)
  5. Submit a pull request

Support


Changelog

v1.0.0 (Initial Release)

  • Multi-format document support (PDF, TXT, MD, DOCX, PPTX)
  • Real GPT tokenization via tiktoken
  • Streaming extraction with constant-memory processing
  • Batch processing with adaptive concurrency
  • OCR/scanned PDF detection
  • CLI and Python API
  • Comprehensive error handling
  • Memory protection and timeout enforcement

Roadmap

  • Async API support
  • Web API endpoint
  • Language-specific tokenizer support
  • Streaming file uploads
  • Token cost calculator for multiple LLM providers

Made with ❤️ by Pranesh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctok-1.0.0.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doctok-1.0.0-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file doctok-1.0.0.tar.gz.

File metadata

  • Download URL: doctok-1.0.0.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doctok-1.0.0.tar.gz
Algorithm Hash digest
SHA256 04f71639cb281a2d9dfb4fa7ffb9d0567dc6c46146f50a1c5dffd1d77d16b456
MD5 ee31b538b0dcc77935ed249ae9b22003
BLAKE2b-256 7164a8a91ba8c34c807b40f6c70f36c444ee73f6ef1c6a51dee149f96751bcd6

See more details on using hashes here.

File details

Details for the file doctok-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: doctok-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doctok-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e702a231dcc9386135e10f2cd1a04a8b6ce05f28b774b259d268710a53b454d3
MD5 d779ad848e2567ca4bd02e1b1bd4f467
BLAKE2b-256 d7d0ddb29bc778f6747d60ba1dbc30edb7a68e8ebcad23efd9131cfae74a6d57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page