Skip to main content

A package to count tokens in input text using OpenAI's tiktoken library.

Project description

gptwc: wc for GPT tokens

The wc utility counts words or characters. The gptwc utility functions similarly but counts tokens. Tokens are smaller than words but larger than characters, and are a more compact representation of text used by large language models.

Use gptwc to check the number of tokens in a string, in order to remain under the token limit (eg. 4097) for your large language model API. Uses tiktoken.

Installation

$ pip install gptwc

$ echo "Simple is better than complex." | gptwc
7

Example Usage

$ cat LICENSE  | gptwc
257
$ cat LICENSE | wc -c
1059
$ cat LICENSE | wc -w
165


$ curl -s 'https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt' | wc -w
26470

curl -s 'https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt' | gptwc
40085


$ cat LICENSE | gptwc --model text-davinci-003
257
$ cat LICENSE | gptwc --model gpt-3.5-turbo
201


$ cat README.md | pbcopy
$ gptwc -c
517

Options

usage: gptwc [-h] [--files0-from F] [--model MODEL] [-c] [--version] [FILE ...]

Count tokens in text files using OpenAI's tiktoken library.

positional arguments:
  FILE             Text files to count tokens in

options:
  -h, --help       show this help message and exit
  --files0-from F  Read input from the files specified by NUL-terminated names in file F
  --model MODEL    Model name to use for tokenization (default: text-davinci-003)
  -c, --clipboard  Read input from the system clipboard
  --version        show program's version number and exit

Which Tokenizer Does Each Model Use?

From tiktoken/model.py

"gpt-4": "cl100k_base",
"gpt-3.5-turbo": "cl100k_base",
"text-embedding-ada-002": "cl100k_base",

"text-davinci-003": "p50k_base",
"text-davinci-002": "p50k_base",
"code-davinci-002": "p50k_base",
"code-davinci-001": "p50k_base",
"code-cushman-002": "p50k_base",
"code-cushman-001": "p50k_base",
"davinci-codex": "p50k_base",
"cushman-codex": "p50k_base",

"text-davinci-001": "r50k_base",
"text-curie-001": "r50k_base",
"text-babbage-001": "r50k_base",
"text-ada-001": "r50k_base",
"davinci": "r50k_base",
"curie": "r50k_base",
"babbage": "r50k_base",
"ada": "r50k_base",
"text-similarity-davinci-001": "r50k_base",
"text-similarity-curie-001": "r50k_base",
"text-similarity-babbage-001": "r50k_base",
"text-similarity-ada-001": "r50k_base",
"text-search-davinci-doc-001": "r50k_base",
"text-search-curie-doc-001": "r50k_base",
"text-search-babbage-doc-001": "r50k_base",
"text-search-ada-doc-001": "r50k_base",
"code-search-babbage-code-001": "r50k_base",
"code-search-ada-code-001": "r50k_base",

"text-davinci-edit-001": "p50k_edit",
"code-davinci-edit-001": "p50k_edit",

"gpt2": "gpt2",

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gptwc-1.2.5.tar.gz (3.7 kB view details)

Uploaded Source

Built Distributions

gptwc-1.2.5-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

gptwc-1.2.5-py2.py3-none-any.whl (4.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file gptwc-1.2.5.tar.gz.

File metadata

  • Download URL: gptwc-1.2.5.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for gptwc-1.2.5.tar.gz
Algorithm Hash digest
SHA256 58a3a4efba982c29527e1f2ffc3e530652640d7bbb7773c96deb0173eeadcf6d
MD5 3dd7b326612c9c11dff702ca9a2c7214
BLAKE2b-256 c7d844fc2014440dd0e1de2b3138d7dc74f0773cb9fbea2312b79912d1680ae0

See more details on using hashes here.

File details

Details for the file gptwc-1.2.5-py3-none-any.whl.

File metadata

  • Download URL: gptwc-1.2.5-py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for gptwc-1.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 abd0698f6ac1a485ea60ae829e28a05e6d3212cd1251c89460af99426b0b994c
MD5 1458f91c8b3eaf4ca36bb4b279a0f246
BLAKE2b-256 e75a9c32f9dfb9b9a4702f5125939f1acc52d9f0b7dacb9d76f30bad2e4fbc4b

See more details on using hashes here.

File details

Details for the file gptwc-1.2.5-py2.py3-none-any.whl.

File metadata

  • Download URL: gptwc-1.2.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for gptwc-1.2.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 61fcc90a694bf05fa6f9c1cc1ca6c915c9a04331dcbe2d783b1bda13c0423976
MD5 4904f453a51aa9847c2bcfd8dd13650d
BLAKE2b-256 562cd67ffa2a42b60ce279b68ae3014c10a33a0c0ed45035b49d7ae270342e37

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page