Production-grade token counter for PDF, TXT, DOCX, MD, and PPTX files using real GPT tokenization via tiktoken

These details have not been verified by PyPI

Project links

Project description

Token Calculator

Production-grade token counter for PDF, TXT, DOCX, MD, and PPTX files using real GPT tokenization via tiktoken

Features • Installation • Quick Start • API • CLI

Features

✨ Real GPT Tokenization - Uses OpenAI's tiktoken library for accurate token counting (not approximations)

⚡ Multi-Format Support - Handles PDF, TXT, MD, DOCX, PPTX files seamlessly

🔄 Streaming Architecture - Constant-memory processing for large files (handles 500MB+ efficiently)

🚀 Adaptive Concurrency - Automatic CPU detection and parallel batch processing

🔍 OCR/Scanned PDF Detection - Automatically identifies whether PDFs are text-based or scanned images

⏱️ Timeout Enforcement - Prevents hanging on malformed or problematic files

💾 Memory Protected - Soft (512MB) and hard (2048MB) memory limits with enforcement

🎯 Enterprise Ready - Designed for RAG, LLM preprocessing, token budgeting, and bulk analytics

Installation

From PyPI

pip install doctok

From source

git clone https://github.com/Pranesh-2005/Token-Calculator.git
cd Token-Calculator
pip install -e .

Quick Start

Command Line

# Count tokens in a single file
doctok document.pdf

# Process an entire directory
doctok ./documents/

# Output results as JSON
doctok document.pdf --output results.json --format json

# Compute SHA256 hashes
doctok document.pdf --hash

# Specify timeout (seconds)
doctok large-file.pdf --timeout 600

# Use multiple workers (default: auto-detected)
doctok ./documents/ --workers 8

Python API

from token_calculator import count_file, count_files_batch

# Single file
result = count_file("document.pdf")
print(f"Tokens: {result.gpt_tokens}")
print(f"Words: {result.word_count}")
print(f"Characters: {result.char_count}")
print(f"Pages: {result.pages}")
print(f"Status: {result.status}")  # 'ok', 'scanned', 'timeout', 'error'

# Batch processing
results = count_files_batch(
    ["doc1.pdf", "doc2.docx", "doc3.txt"],
    max_workers=4,
    compute_hash=True
)
print(f"Total tokens: {results.total_tokens}")
print(f"Total pages: {results.total_pages}")
print(f"Elapsed time: {results.elapsed_sec:.2f}s")

API Reference

Core Functions

`count_file(path: str, timeout_sec: float = 300, compute_hash: bool = False) -> FileResult`

Process a single file and return detailed metrics.

Parameters:

path (str): Path to the file to process
timeout_sec (float): Timeout in seconds (default: 300)
compute_hash (bool): Compute SHA256 hash (default: False)

Returns:

FileResult object with:
- gpt_tokens (int): Token count using GPT tokenizer
- char_count (int): Character count
- word_count (int): Word count
- pages (int): Page count
- extractable_pages (int): Pages with extractable text
- skipped_pages (int): Pages without text (scanned PDFs)
- status (str): 'ok', 'scanned', 'timeout', 'error'
- error_msg (str): Error details if applicable
- elapsed_sec (float): Processing time

Example:

from token_calculator import count_file

result = count_file("paper.pdf")
if result.status == "ok":
    print(f"{result.gpt_tokens} tokens")
elif result.status == "scanned":
    print("PDF is scanned (no OCR)")
else:
    print(f"Error: {result.error_msg}")

`count_files_batch(paths: list[str], max_workers: Optional[int] = None, compute_hash: bool = False) -> BatchResult`

Process multiple files in parallel.

Parameters:

paths (list[str]): List of file paths
max_workers (int): Max parallel workers (default: auto-detected, max 8)
compute_hash (bool): Compute SHA256 hashes (default: False)

Returns:

BatchResult object with:
- files (list[FileResult]): Results for each file
- total_tokens (int): Sum of all tokens
- total_words (int): Sum of all words
- total_chars (int): Sum of all characters
- total_pages (int): Sum of all pages
- file_count (int): Number of files processed
- error_count (int): Number of failed files
- elapsed_sec (float): Total elapsed time

Example:

from token_calculator import count_files_batch
import json

results = count_files_batch([
    "doc1.pdf",
    "doc2.docx",
    "doc3.txt"
])

# Export as JSON
output = {
    "total_tokens": results.total_tokens,
    "files": [
        {
            "name": f.filename,
            "tokens": f.gpt_tokens,
            "status": f.status
        }
        for f in results.files
    ]
}
print(json.dumps(output, indent=2))

Exception Handling

from token_calculator import (
    count_file,
    TokenCounterError,
    FileTooLargeError,
    UnsupportedFileError,
    FileTimeoutError,
)

try:
    result = count_file("document.pdf")
except FileTooLargeError:
    print("File exceeds 500MB limit")
except UnsupportedFileError as e:
    print(f"Format not supported: {e}")
except FileTimeoutError:
    print("Processing timed out")
except TokenCounterError as e:
    print(f"Processing error: {e}")

Supported Formats

Format	Extension	Features
PDF	`.pdf`	Text extraction, page count, scanned PDF detection
Plain Text	`.txt`	Encoding auto-detection, streaming processing
Markdown	`.md`	Standard markdown text extraction
Word	`.docx`	Paragraph and table extraction
PowerPoint	`.pptx`	Slide and shape text extraction

Configuration

Default Limits

Max file size: 500 MB
Global timeout: 300 seconds
PDF page timeout: 10 seconds
Soft memory limit: 512 MB
Hard memory limit: 2048 MB
Max workers: CPU count (capped at 8)

To use different limits, process files locally and modify core.py constants.

Performance

Small files (< 10MB): < 1 second
Medium files (10-100MB): 2-10 seconds
Large files (100-500MB): 30-120 seconds
Batch of 100 files: ~2 minutes (with 8 workers)

Memory usage remains constant regardless of file size due to streaming architecture.

Use Cases

📊 RAG Pipeline Optimization - Calculate token costs before ingestion
🧠 LLM Preprocessing - Validate document sizes for model context windows
💰 Token Budgeting - Estimate API costs for document processing
📈 Enterprise Analytics - Batch analyze large document corpora
🔍 Content Indexing - Organize documents by token complexity

Examples

Example 1: Calculate API Cost

from token_calculator import count_file

result = count_file("paper.pdf")

# Pricing: $0.01 per 1K tokens (example)
cost = (result.gpt_tokens / 1000) * 0.01
print(f"Estimated API cost: ${cost:.2f}")

Example 2: Batch Processing with Error Handling

from token_calculator import count_files_batch
from pathlib import Path

pdf_files = list(Path("./documents").glob("*.pdf"))

results = count_files_batch([str(p) for p in pdf_files], max_workers=8)

print(f"Successfully processed: {results.file_count - results.error_count}")
print(f"Failed: {results.error_count}")
print(f"Total tokens: {results.total_tokens:,}")

for file_result in results.files:
    if file_result.status != "ok":
        print(f"⚠️  {file_result.filename}: {file_result.error_msg}")

Example 3: RAG Ingestion Check

from token_calculator import count_file

# Check if document fits in context window
MAX_CONTEXT = 8000  # tokens

result = count_file("document.pdf")

if result.gpt_tokens > MAX_CONTEXT:
    print(f"⚠️  Document is {result.gpt_tokens} tokens (exceeds {MAX_CONTEXT})")
    print(f"Split into {result.gpt_tokens // MAX_CONTEXT + 1} chunks")
else:
    print(f"✓ Document fits in context ({result.gpt_tokens} tokens)")

CLI Options

usage: token_calculator [-h] [-w WORKERS] [-t TIMEOUT] [-o OUTPUT] [--hash] [--format {text,json}] path

positional arguments:
  path                  File or directory to process

optional arguments:
  -h, --help           Show help message
  -w, --workers N      Max parallel workers (default: auto)
  -t, --timeout SEC    Per-file timeout in seconds (default: 300)
  -o, --output FILE    Write JSON results to file
  --hash               Compute SHA256 hash for each file
  --format {text,json} Output format (default: text)

Logging

import logging
from token_calculator import setup_logging, count_file

# Enable debug logging
setup_logging(logging.DEBUG)

result = count_file("document.pdf")

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/improvement)
Make your changes
Run tests (pytest)
Submit a pull request

Support

📖 GitHub Issues
💬 GitHub Discussions

Changelog

v1.0.0 (Initial Release)

Multi-format document support (PDF, TXT, MD, DOCX, PPTX)
Real GPT tokenization via tiktoken
Streaming extraction with constant-memory processing
Batch processing with adaptive concurrency
OCR/scanned PDF detection
CLI and Python API
Comprehensive error handling
Memory protection and timeout enforcement

Roadmap

Async API support
Web API endpoint
Language-specific tokenizer support
Streaming file uploads
Token cost calculator for multiple LLM providers

Made with ❤️ by Pranesh

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.2

Jun 27, 2026

1.0.1

May 29, 2026

1.0.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctok-1.0.2.tar.gz (12.7 kB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doctok-1.0.2-py3-none-any.whl (12.3 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file doctok-1.0.2.tar.gz.

File metadata

Download URL: doctok-1.0.2.tar.gz
Upload date: Jun 27, 2026
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for doctok-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`bb49f8a5bd0fe4c871859bd04cd59f8be77012d1942016a97e9c5c451fdd4e5a`
MD5	`2050febbf9dea65c2ea85c9bac094f9e`
BLAKE2b-256	`bb5eca9174d3cf06e921e8fb72dfb721e82336948a845527c5c760fd9c2af67b`

See more details on using hashes here.

File details

Details for the file doctok-1.0.2-py3-none-any.whl.

File metadata

Download URL: doctok-1.0.2-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 12.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for doctok-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b87e2a6c0bbc905892cad5fc639152620c07209e3744e7f2c53d68c594e6137`
MD5	`5f4288acdd6bfbc64909a72e483450bc`
BLAKE2b-256	`a2ed7628b2bdf9dfeb2e80d32ac12a14da8aad177321eaf97f305bd5b8a98ebb`

See more details on using hashes here.

doctok 1.0.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Token Calculator

Features

Installation

From PyPI

From source

Quick Start

Command Line

Python API

API Reference

Core Functions

count_file(path: str, timeout_sec: float = 300, compute_hash: bool = False) -> FileResult

count_files_batch(paths: list[str], max_workers: Optional[int] = None, compute_hash: bool = False) -> BatchResult

Exception Handling

Supported Formats

Configuration

Default Limits

Performance

Use Cases

Examples

Example 1: Calculate API Cost

Example 2: Batch Processing with Error Handling

Example 3: RAG Ingestion Check

CLI Options

Logging

License

Contributing

Support

Changelog

v1.0.0 (Initial Release)

Roadmap

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`count_file(path: str, timeout_sec: float = 300, compute_hash: bool = False) -> FileResult`

`count_files_batch(paths: list[str], max_workers: Optional[int] = None, compute_hash: bool = False) -> BatchResult`