A package to count tokens for various file types with API integration.
Project description
Token Itemize
Token Itemize is a versatile Python package engineered to accurately count tokens across a wide array of file formats, including text, images, audio, and video. It seamlessly integrates with various Large Language Model (LLM) APIs, enabling precise cost calculation for your AI workloads. Token Itemize offers both a user-friendly Command-Line Interface (CLI) and an intuitive Graphical User Interface (GUI), catering to users of all technical levels.
โจ Key Features
-
Comprehensive Multi-Format Tokenization:
- Text Files: Leverages
tiktokenfor advanced tokenization or offers a simple whitespace splitting method for basic text analysis. - Images: Decomposes images into 16x16 pixel patches for a granular token estimation.
- Audio Files: Employs spectrogram window estimation for detailed audio analysis, supporting popular formats like .wav, .mp3, and .flac.
- Video Files: Processes video by extracting frames and applying image tokenization techniques for frame-by-frame analysis.
- Text Files: Leverages
-
Flexible API Integration:
- LLM API Compatibility: Designed to work with both local and cloud-based LLM endpoints.
- Configurable API Settings: Allows easy customization of API parameters such as endpoint URLs, model names, and API keys.
- Broad Provider Support: Supports a range of API providers including:
- Ollama: For local LLMs.
- OpenAI: Including models like GPT-3.5 Turbo and GPT-4 Vision.
- DeepSeek: For efficient and cost-effective models.
- OpenRouter: To access a wide range of models through a single API key.
-
Command-Line Interface (CLI) for Power Users:
- Efficient File Processing: Process individual files and entire folders directly from the command line.
- Prompt Specification: Easily include prompts for API-based tokenization directly in your commands.
- File Exclusion: Utilize regular expressions to exclude specific files or patterns for precise processing.
- Batch Processing: Optimized for handling large directories with batch processing capabilities.
- Versatile Output Formats: Export results in both JSON and CSV formats for easy data handling and reporting.
- Cost Management: Configure the cost rate per 1k tokens to accurately estimate expenses.
- Verbose Logging: Option for detailed logging to track processing and debug issues.
-
Graphical User Interface (GUI) for Ease of Use:
- Drag and Drop Functionality: Simply drag and drop files or folders into the GUI for immediate processing.
- Intuitive Prompt Input: Clear fields to enter prompts and adjust API settings within the interface.
- Real-time Progress Tracking: Built-in progress bar to monitor the tokenization process.
- Simple Result Export: Easily export tokenization results in CSV or JSON format through the GUI.
- Gitignore Integration: Option to automatically apply
.gitignorerules to exclude files, streamlining project analysis.
-
Advanced Efficiency Tools:
- Intelligent Caching: Employs a caching system to avoid redundant processing of unchanged files, saving time and resources.
- Parallel Processing: Leverages parallel processing for batch operations to significantly speed up token counting for large datasets.
- Cost Calculation and Estimation: Automatically calculates costs based on token counts, using a default rate of $0.03 per 1k tokens, customizable to your specific pricing.
โ๏ธ CLI Commands
Token Itemize CLI offers a rich set of options to tailor token counting to your specific needs.
Usage: token-itemize [options]
Available Options:
--gui Launch the graphical user interface.
--prompt PROMPT User prompt for API processing (text prompt).
--file FILE Individual file to process (can be specified multiple times).
--folder FOLDER Folder to process (can be specified multiple times for batch processing).
--exclude REGEX Regular expression pattern to exclude files from processing.
--verbose Enable verbose logging for detailed output.
--api Enable API mode for tokenization and cost estimation.
--provider PROVIDER Specify the API provider (ollama, openai, deepseek, openrouter).
--endpoint URL API endpoint URL for providers like Ollama and DeepSeek.
--model MODEL LLM model name (required in API mode, provider-specific).
--api-key KEY API key for authentication (required for OpenAI, DeepSeek, OpenRouter).
--cost-rate RATE Cost rate per 1k tokens (default: 0.03). Customize for different models or providers.
--batch Enable batch processing for large directories to improve performance.
--output-format FORMAT Output format for results (json or csv, default: json).
--gitignore Enable .gitignore filtering to exclude files.
--version Display package version and exit.
Examples:
-
Launch GUI:
token-itemize --gui -
Process a single file with API:
token-itemize --api --provider openai --model gpt-3.5-turbo --api-key YOUR_OPENAI_API_KEY --prompt "Summarize this document" --file document.txt
-
Process a folder and exclude specific files:
token-itemize --folder project_docs --exclude ".*\.log" --output-format csv --verbose
-
Process a video file and calculate tokens:
token-itemize --file video.mp4
๐ป GUI Usage
For users who prefer a visual approach, Token Itemize provides a Graphical User Interface.
Launching the GUI:
token-itemize --gui
Key GUI Features:
- File and Folder Selection: Use the "Add Files" and "Add Folder" buttons to easily import files or directories for tokenization. Drag and drop functionality is also supported.
- Prompt Input: Enter your text prompts in the designated "Prompt" text area for API-based processing.
- API Settings Configuration: Enable API mode with the checkbox and configure:
- Provider Selection: Choose your API provider from a dropdown (Ollama, OpenAI, DeepSeek, OpenRouter).
- Endpoint URL: Specify the API endpoint if needed (e.g., for local Ollama or DeepSeek instances).
- Model Name: Enter the model identifier for your chosen provider.
- API Key: Input your API key for authentication with services like OpenAI, DeepSeek, and OpenRouter.
- Gitignore Filtering: Check the ".gitignore Filtering" box to automatically apply
.gitignorerules to your selected files and folders, excluding any files listed in your.gitignorefile. - Start Token Counting: Click the "Count Tokens" button to begin the tokenization process. A progress bar at the bottom of the window will indicate the current status.
- Results Display: The results table will populate, showing the file name (or "Prompt" for API-only prompts), token counts, and processing details for each item. A "Total" row summarizes the total tokens and estimated cost (if applicable).
- Error Handling: Error messages are displayed in pop-up dialogs for clear communication of issues during processing.
- Export Results: Use the "Export Results" button to save the results table as either a CSV (.csv) or JSON (.json) file.
๐ ๏ธ Installation
Prerequisites
- Python 3.8 or higher is required.
Installation via pip (Recommended)
The easiest way to install Token Itemize is using pip, Python's package installer. Open your terminal and run:
pip install token-itemize
This command will download and install Token Itemize and all necessary dependencies.
Installation from Source (for developers)
If you are a developer or want to contribute to Token Itemize, you can install it from source:
-
Clone the Repository:
git clone https://github.com/BenevolenceMessiah/token-itemize.git
-
Navigate to the Project Directory:
cd token-itemize
-
Install Dependencies and the Package:
pip install .
or if you plan to develop:
pip install -e .
๐ API Client Usage in Python
Token Itemize is not just a CLI and GUI tool; it's also a Python library! You can directly integrate its API client into your Python scripts for programmatic token counting and API interactions.
from token_itemize.api.api_client import get_api_client
# Example: Using OpenAI API Client
api_client_openai = get_api_client(
provider="openai",
api_key="YOUR_OPENAI_API_KEY",
model="gpt-3.5-turbo",
verbose=True
)
openai_result = api_client_openai.count_tokens(files=["document1.txt", "image.png"], prompt="Analyze these files.")
print(f"OpenAI Input Tokens: {openai_result['input_tokens']}")
print(f"OpenAI Response: {openai_result['full_response']}")
# Example: Using Ollama API Client (for local models)
api_client_ollama = get_api_client(
provider="ollama",
endpoint="http://localhost:11434", # Default Ollama endpoint
model="llama2:latest",
verbose=True
)
ollama_result = api_client_ollama.count_tokens(files=["document2.txt", "audio.wav"], prompt="Process these files with Ollama.")
print(f"Ollama Input Tokens: {ollama_result['input_tokens']}")
print(f"Ollama Response: {ollama_result['full_response']}")
You can use get_api_client to instantiate clients for "openai", "ollama", "deepseek", and "openrouter", passing in necessary credentials and settings. Refer to the token_itemize/api/api_client.py for detailed class structures and available methods for each API provider.
โ๏ธ Configuration
Config File (config.yaml)
Token Itemize supports configuration via a config.yaml file placed in the project root directory. This file allows you to set default API settings, which can be especially useful if you frequently use API mode.
endpoint: "http://localhost:8000/api/tokenize" # Default API endpoint, can be overridden in CLI/GUI
model: "gpt-4-vision" # Default model, also overridable
api_key: "" # Default API key, ensure to set this or pass via CLI/GUI
cost_rate: 0.03 # Default cost rate per 1k tokens
Token Itemize will automatically load these settings when it starts, applying them as defaults for API operations, unless overridden by command-line arguments or GUI inputs.
๐ฐ Cost Calculation
Token Itemize meticulously calculates the estimated cost by multiplying the total token count by a configurable cost rate. By default, the cost rate is set to $0.03 per 1,000 tokens. You can customize this rate to match the pricing of your specific LLM or API provider.
- Configuration via CLI: Use the
--cost-rate RATEoption when using the command-line interface. - Configuration via GUI: The cost rate is set as a default within the GUI application but can be adjusted in the backend code if needed.
- Configuration via API Client: When using the Python API client, you can specify the
cost_rateduring client initialization. - Configuration via
config.yaml: Set a global default cost rate in theconfig.yamlfile.
๐๏ธ Project Structure
token-itemize/
โโโ token_itemize/ # Main package directory
โ โโโ __init__.py # Initializes the package
โ โโโ api/ # API related modules
โ โ โโโ __init__.py # Initializes the api submodule
โ โ โโโ api_client.py # API Client classes for different providers (OpenAI, Ollama, DeepSeek, OpenRouter)
โ โ โโโ conversation_saver.py # Functionality to save conversation transcripts
โ โโโ tokenizers/ # Tokenization logic for different file types
โ โ โโโ __init__.py # Initializes the tokenizers submodule
โ โ โโโ audio_tokenizer.py # Tokenizer for audio files
โ โ โโโ image_tokenizer.py # Tokenizer for image files
โ โ โโโ text_tokenizer.py # Tokenizer for text files
โ โ โโโ video_tokenizer.py # Tokenizer for video files
โ โโโ utils/ # Utility functions
โ โ โโโ cache.py # Caching mechanisms for token counts
โ โโโ cli.py # Command-line interface logic
โ โโโ config.py # Configuration loading and handling
โ โโโ gui/ # Graphical user interface components
โ โ โโโ __init__.py # Initializes the gui submodule
โ โ โโโ gui_app.py # PyQt5 GUI application
โ โโโ main.py # Main entry point for the CLI and GUI
โโโ tests/ # Test suite
โ โโโ __init__.py # Initializes the tests package
โ โโโ test_api_client.py # Tests for API client functionality
โ โโโ test_edge_cases.py # Tests for edge case scenarios
โ โโโ test_text_tokenizer.py # Tests for text tokenization
โ โโโ test_video_tokenizer.py # Tests for video tokenization
โโโ docs/ # Documentation files (markdown format)
โ โโโ api.md # API documentation
โ โโโ contributing.md # Contribution guidelines
โ โโโ index.md # Main documentation index
โโโ .github/ # GitHub workflow configurations
โ โโโ workflows/ # Workflow definitions
โ โโโ python-app.yml # CI/CD workflow for testing
โโโ .gitignore # Specifies intentionally untracked files that Git should ignore
โโโ LICENSE # MIT License file
โโโ README.md # Project README file (this file)
โโโ MANIFEST.in # Lists files to include in package distribution
โโโ config.yaml # Default configuration file
โโโ pyproject.toml # Build system configuration
โโโ requirements.txt # Project dependencies
โโโ setup.py # Setup script for packaging
โโโ token-itemize.egg-info/ # Setuptools egg-info (created during installation)
โโโ ... # Various metadata files
๐งโ๐ป Development & Testing
Running Tests
To ensure the reliability and correctness of Token Itemize, a comprehensive suite of unit tests is included. To run the tests, navigate to the project's root directory and execute:
python -m unittest discover -s tests
This command will discover and run all tests located in the tests/ directory, verifying the functionality of different components of Token Itemize.
Contributing
We warmly welcome contributions to Token Itemize! Whether you're fixing bugs, adding new features, or improving documentation, your help is valuable. Please see docs/contributing.md for detailed guidelines on how to contribute to the project. Here's a quick start:
- Fork the Repository: Start by forking the Token Itemize repository on GitHub to your own account.
- Create a Feature Branch:
git checkout -b feature/your-feature-name
- Commit Your Changes:
git commit -am 'Add your feature or fix'
- Push to the Branch:
git push origin feature/your-feature-name
- Create a Pull Request: Submit a pull request to the main repository with a clear description of your changes.
๐ค Support & Community
For any issues, questions, or feature requests, please open an issue on the GitHub repository. We are committed to providing support and continuously improving Token Itemize.
๐ License
Token Itemize is released under the MIT License, making it free for commercial and personal use. See the LICENSE file for the full license text.
๐ Documentation
For more in-depth information and advanced usage, refer to the full documentation available in the docs directory.
Thank you for choosing Token Itemize! We hope this tool enhances your workflow and simplifies token management and cost estimation for your projects. Happy tokenizing!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file token_itemize-1.0.0.tar.gz.
File metadata
- Download URL: token_itemize-1.0.0.tar.gz
- Upload date:
- Size: 3.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
060d0b055651bc28209d3cbf58a7ddd1b1d8e73b407795d8e84e1135333d6359
|
|
| MD5 |
64c4c28b64379f6a5443046f5a6e67fc
|
|
| BLAKE2b-256 |
78bca5d76fcfb035df3cb0da4cc75d4f1f08369bf6edba74664677ad02f00fd4
|
File details
Details for the file token_itemize-1.0.0-py3-none-any.whl.
File metadata
- Download URL: token_itemize-1.0.0-py3-none-any.whl
- Upload date:
- Size: 3.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ad173129278f3df4f7119c8ce79b7ab77f61d898a078490f935fcc811c26d01
|
|
| MD5 |
65cc7d9171c9ad93ce88c876e6b007c5
|
|
| BLAKE2b-256 |
e091528858d571ab75ee16230848e4de76230ae0d2d14d43034934d0b1be22b1
|