Skip to main content

A package to count tokens in input text using OpenAI's tiktoken library.

Project description

gptwc: wc for GPT tokens

The wc utility counts words or characters. The gptwc utility functions similarly but counts tokens. Tokens are smaller than words but larger than characters, and are a more compact representation of text used by large language models.

Use gptwc to check the number of tokens in a string, in order to remain under the token limit (eg. 4097) for your large language model API. Uses tiktoken.

Installation

$ pip install gptwc

$ echo "Simple is better than complex." | gptwc
7

Example Usage

$ cat LICENSE  | gptwc
257
$ cat LICENSE | wc -c
1059
$ cat LICENSE | wc -w
165


$ curl -s 'https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt' | wc -w
26470

curl -s 'https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt' | gptwc
40085


$ cat LICENSE | gptwc --model text-davinci-003
257
$ cat LICENSE | gptwc --model gpt-3.5-turbo
201


$ cat README.md | pbcopy
$ gptwc -c
517

Options

usage: gptwc [-h] [--files0-from F] [--model MODEL] [-c] [--version] [FILE ...]

Count tokens in text files using OpenAI's tiktoken library.

positional arguments:
  FILE             Text files to count tokens in

options:
  -h, --help       show this help message and exit
  --files0-from F  Read input from the files specified by NUL-terminated names in file F
  --model MODEL    Model name to use for tokenization (default: text-davinci-003)
  -c, --clipboard  Read input from the system clipboard
  --version        show program's version number and exit

Which Tokenizer Does Each Model Use?

From tiktoken/model.py

"gpt-4": "cl100k_base",
"gpt-3.5-turbo": "cl100k_base",
"text-embedding-ada-002": "cl100k_base",

"text-davinci-003": "p50k_base",
"text-davinci-002": "p50k_base",
"code-davinci-002": "p50k_base",
"code-davinci-001": "p50k_base",
"code-cushman-002": "p50k_base",
"code-cushman-001": "p50k_base",
"davinci-codex": "p50k_base",
"cushman-codex": "p50k_base",

"text-davinci-001": "r50k_base",
"text-curie-001": "r50k_base",
"text-babbage-001": "r50k_base",
"text-ada-001": "r50k_base",
"davinci": "r50k_base",
"curie": "r50k_base",
"babbage": "r50k_base",
"ada": "r50k_base",
"text-similarity-davinci-001": "r50k_base",
"text-similarity-curie-001": "r50k_base",
"text-similarity-babbage-001": "r50k_base",
"text-similarity-ada-001": "r50k_base",
"text-search-davinci-doc-001": "r50k_base",
"text-search-curie-doc-001": "r50k_base",
"text-search-babbage-doc-001": "r50k_base",
"text-search-ada-doc-001": "r50k_base",
"code-search-babbage-code-001": "r50k_base",
"code-search-ada-code-001": "r50k_base",

"text-davinci-edit-001": "p50k_edit",
"code-davinci-edit-001": "p50k_edit",

"gpt2": "gpt2",

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gptwc-1.2.4.tar.gz (3.7 kB view details)

Uploaded Source

Built Distributions

gptwc-1.2.4-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

gptwc-1.2.4-py2.py3-none-any.whl (4.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file gptwc-1.2.4.tar.gz.

File metadata

  • Download URL: gptwc-1.2.4.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for gptwc-1.2.4.tar.gz
Algorithm Hash digest
SHA256 0fbbfde019cf47438fe2fe14b432f4e41629f624efa7751afbaea8868b662a19
MD5 668571fd838a7489fad8756e52840128
BLAKE2b-256 3cf18f5ad8d84e2189fa6339004e82f7c70999f398381bf100d2070017a0e2f8

See more details on using hashes here.

File details

Details for the file gptwc-1.2.4-py3-none-any.whl.

File metadata

  • Download URL: gptwc-1.2.4-py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for gptwc-1.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c38e9a5ee38d8d41578c492ee07841e8bf0618820a049cfd9001a8607229380b
MD5 236be27462455ae85ee11d4730ba5a03
BLAKE2b-256 773c7b4721583ec1d95d877c004de3d43122d0bdebdc22df3ea3e8fb92c79888

See more details on using hashes here.

File details

Details for the file gptwc-1.2.4-py2.py3-none-any.whl.

File metadata

  • Download URL: gptwc-1.2.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for gptwc-1.2.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 de4468efb821ca82166e5a8a9c80ad331b08f0f94397f24cf5b4a74f8bc015d6
MD5 10604f5ea13a8ae13c997d8dcae2de93
BLAKE2b-256 c033da9346647937a77e95c32279db0a4b06028046e415e0f38be45ae9d0d49a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page