gptwc

A package to count tokens in input text using OpenAI's tiktoken library.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

gptwc: wc for GPT tokens

The wc utility counts words or characters. The gptwc utility functions similarly but counts tokens. Tokens are smaller than words but larger than characters, and are a more compact representation of text used by large language models.

Use gptwc to check the number of tokens in a string, in order to remain under the token limit (eg. 4097) for your large language model API. Uses tiktoken.

Installation

$ pip install gptwc

$ echo "Simple is better than complex." | gptwc
7

Example Usage

$ cat LICENSE  | gptwc
257
$ cat LICENSE | wc -c
1059
$ cat LICENSE | wc -w
165


$ curl -s 'https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt' | wc -w
26470

curl -s 'https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt' | gptwc
40085


$ cat LICENSE | gptwc --model text-davinci-003
257
$ cat LICENSE | gptwc --model gpt-3.5-turbo
201


$ cat README.md | pbcopy
$ gptwc -c
517

Options

usage: gptwc [-h] [--files0-from F] [--model MODEL] [-c] [--version] [FILE ...]

Count tokens in text files using OpenAI's tiktoken library.

positional arguments:
  FILE             Text files to count tokens in

options:
  -h, --help       show this help message and exit
  --files0-from F  Read input from the files specified by NUL-terminated names in file F
  --model MODEL    Model name to use for tokenization (default: text-davinci-003)
  -c, --clipboard  Read input from the system clipboard
  --version        show program's version number and exit

Which Tokenizer Does Each Model Use?

From tiktoken/model.py

"gpt-4": "cl100k_base",
"gpt-3.5-turbo": "cl100k_base",
"text-embedding-ada-002": "cl100k_base",

"text-davinci-003": "p50k_base",
"text-davinci-002": "p50k_base",
"code-davinci-002": "p50k_base",
"code-davinci-001": "p50k_base",
"code-cushman-002": "p50k_base",
"code-cushman-001": "p50k_base",
"davinci-codex": "p50k_base",
"cushman-codex": "p50k_base",

"text-davinci-001": "r50k_base",
"text-curie-001": "r50k_base",
"text-babbage-001": "r50k_base",
"text-ada-001": "r50k_base",
"davinci": "r50k_base",
"curie": "r50k_base",
"babbage": "r50k_base",
"ada": "r50k_base",
"text-similarity-davinci-001": "r50k_base",
"text-similarity-curie-001": "r50k_base",
"text-similarity-babbage-001": "r50k_base",
"text-similarity-ada-001": "r50k_base",
"text-search-davinci-doc-001": "r50k_base",
"text-search-curie-doc-001": "r50k_base",
"text-search-babbage-doc-001": "r50k_base",
"text-search-ada-doc-001": "r50k_base",
"code-search-babbage-code-001": "r50k_base",
"code-search-ada-code-001": "r50k_base",

"text-davinci-edit-001": "p50k_edit",
"code-davinci-edit-001": "p50k_edit",

"gpt2": "gpt2",

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.2.6

Mar 1, 2024

1.2.5

Mar 1, 2024

1.2.4

Mar 17, 2023

1.2.2

Mar 17, 2023

1.2.1

Mar 17, 2023

1.2.0

Mar 17, 2023

1.1.0

Mar 16, 2023

1.0.0

Mar 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gptwc-1.2.6.tar.gz (3.9 kB view hashes)

Uploaded Mar 1, 2024 Source

Built Distributions

gptwc-1.2.6-py3-none-any.whl (4.4 kB view hashes)

Uploaded Mar 1, 2024 Python 3

gptwc-1.2.6-py2.py3-none-any.whl (4.4 kB view hashes)

Uploaded Mar 1, 2024 Python 2 Python 3

Hashes for gptwc-1.2.6.tar.gz

Hashes for gptwc-1.2.6.tar.gz
Algorithm	Hash digest
SHA256	`9ebcddf4419eb7ed66a804997dbee0f4482fef017e36904b171e539e83a5662c`
MD5	`52b5c66fad28523f15d70d78ea926e32`
BLAKE2b-256	`bd7f01c3441ff7d6c627f33bb41d3268c8671bf1a862b1a363e26e6a2ba287e2`

Hashes for gptwc-1.2.6-py3-none-any.whl

Hashes for gptwc-1.2.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`db5bf468da8e6223c3f03938f97f047b6d4bf4bc0c851c35274ea713cccb06bc`
MD5	`6aa182c5b0976d510c06b03cc844da5b`
BLAKE2b-256	`3dbe3484e558febe5c10e585e086205af56d87ef8af12c06bc0f8f105ed80164`

Hashes for gptwc-1.2.6-py2.py3-none-any.whl

Hashes for gptwc-1.2.6-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`d29520030470cfde4c5eb60edf7d2e9fe5b855a4dab4f2f3867c282c8d23626a`
MD5	`670079ca3fb185ef8e621d778a7d1412`
BLAKE2b-256	`6dba352e6a69cb6e51f8e393fba488b19be74f5c43b87f348a10d4614bd30b4a`