Skip to main content

Static Hash-Based Lookup for Google Ngram Frequencies

Project description

gngram-lookup

PyPI version Downloads Downloads/Month Tests Python 3.9+ License: MIT

How common is this word? O(1) answer. 500 years of books. 5 million words.

Word frequency, commonness scores, and part-of-speech tags derived from the Google Books Ngram corpus, the largest longitudinal word-frequency dataset ever compiled.

Quick Start

pip install gngram-lookup
python -m gngram_lookup.download_data       # frequency data, ~110 MB
python -m gngram_lookup.download_pos_data   # POS tag data, optional
import gngram_lookup as ng

# Does this word exist in 500 years of print?
ng.exists('computer')        # True
ng.exists('xyznotaword')     # False

# How common is it? (1=most common, 100=least common)
ng.word_score('the')         # 1
ng.word_score('computer')    # 18
ng.word_score('rucksack')    # 58
ng.word_score('xyznotaword') # None

# Full frequency data
ng.frequency('computer')
# {'peak_tf': 2000, 'peak_df': 2000, 'sum_tf': 892451, 'sum_df': 312876}

# Part-of-speech tags
ng.pos('fast')               # ['ADJ', 'ADV', 'VERB']
ng.has_pos('sing', ng.PosTag.VERB)  # True

Features

  • O(1) Lookups - Hash-bucketed parquet files, no full scan
  • 5 Million Words - Broadest vocabulary coverage of any static lookup package
  • 500 Years of Data - 1500s through 2000s, decade-by-decade
  • Word Score - 1–100 commonness scale, log-normalized against the corpus
  • Peak Decade - Know when a word was most used in print
  • POS Tags - Part-of-speech tags from Google's own tag set
  • Batch Lookup - Efficient multi-word queries grouped by hash prefix
  • Possessive & Hyphen Fallback - ship'sship, north-westnorth
  • Unicode Normalization - Smart quotes, accents, and Unicode apostrophes handled

Word Score Zones

Words are scored 1–100 based on their total corpus frequency, log-normalized against the (the most frequent word). The scale compresses the Zipfian spike at the top and gives meaningful resolution across the rest of the vocabulary.

gngram-lookup Word Score Zones

Score Zone Description Examples
1–5 Zipf Function words dominating millions of books the, of, and, to, a
6–20 Core Everyday words known by all speakers computer, walk, beautiful
21–40 Literary Vocabulary of books, not everyday speech algorithm, philosophy, synthesis
41–60 Specialized Domain-specific but attested rucksack, carbonate, heliotrope
61–80 Rare Infrequent but legitimate arcane technical and literary terms
81–100 Long Tail Extremely low frequency niche, archaic, or highly specialized
import gngram_lookup as ng

# Score a word
ng.word_score('the')          # 1   (Zipfian summit)
ng.word_score('computer')     # 18  (Core vocabulary)
ng.word_score('algorithm')    # 40  (Educated vocabulary)
ng.word_score('rucksack')     # 58  (Specialized vocabulary)

# Filter to common words only
def is_common(word):
    score = ng.word_score(word)
    return score is not None and score <= 30

The Problem This Solves

In NLP, you frequently need to know not just whether a token is a word, but how common it is.

Not "what does it mean?" Not "what's its etymology?" Just: is this a word real people actually write, and how often?

No other static lookup package gives you this from a corpus spanning 500 years of print at 5 million word coverage with O(1) access.

Why Google Books Ngrams?

The Google Books Ngram corpus isn't a dictionary (too narrow). It isn't a web crawl (too noisy). It isn't limited to a single decade or language register.

It's a statistical analysis of over 8 million books spanning the 1500s to the 2000s, one of the largest datasets ever assembled for studying how language changes over time. The frequency data captures real written usage at a scale no other freely available corpus matches.

If a word appears in this corpus with meaningful frequency, it's a word that real writers actually used.

When to Use This

  • Vocabulary filtering - Score words to keep common ones, discard rare ones
  • NLP preprocessing - Filter candidates before expensive model calls
  • Content analysis - Measure the reading level or register of a text
  • Spell-check pre-filtering - Reject low-scoring tokens before fuzzy matching
  • Temporal analysis - Find when a word peaked in usage
  • POS disambiguation - Resolve ambiguous tokens using corpus-attested tags

What This Doesn't Do

  • No definitions, synonyms, or semantic relationships (use spaCy or WordNet for that)
  • No spell-checking or suggestions (just existence and frequency)
  • No real-time or web data (this is a static corpus snapshot)

CLI

exists computer       # True, exit 0
exists xyznotaword    # False, exit 1

score computer        # 18
score xyznotaword     # None, exit 1

freq computer
# peak_tf_decade: 2000
# peak_df_decade: 2000
# sum_tf: 892451
# sum_df: 312876

pos fast              # ADJ ADV VERB
pos-freq corn         # ADJ: 1,433,642 / NOUN: 11,722,803 / VERB: 85,411
has-pos sing VERB     # True, exit 0
has-pos fast NOUN     # False, exit 1

Documentation

Development

git clone https://github.com/craigtrim/gngram-lookup.git
cd gngram-lookup
make install   # Install dependencies
make test      # Run tests
make all       # Full pipeline

See Also

  • bnc-lookup - O(1) lookup for British National Corpus (669k words, zero setup)
  • wordnet-lookup - O(1) lookup for WordNet

Attribution

Data derived from the Google Books Ngram dataset.

License

This package is dual-licensed:

  • Software: MIT License
  • Ngram Data: Creative Commons Attribution 3.0 Unported (CC BY 3.0)

See LICENSE for complete terms.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gngram_lookup-1.2.6.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gngram_lookup-1.2.6-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file gngram_lookup-1.2.6.tar.gz.

File metadata

  • Download URL: gngram_lookup-1.2.6.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.9 Darwin/25.3.0

File hashes

Hashes for gngram_lookup-1.2.6.tar.gz
Algorithm Hash digest
SHA256 0c44da4a78c69c5b8540d0020cfbe759847674ac08f51e8897bcd95e2177ca60
MD5 6b423b2842fe6e53481c0758c5c24517
BLAKE2b-256 772aa615bddcbe102bb8a814bc69d32efd1dec0dbbdedefbffa590d677902c22

See more details on using hashes here.

File details

Details for the file gngram_lookup-1.2.6-py3-none-any.whl.

File metadata

  • Download URL: gngram_lookup-1.2.6-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.9 Darwin/25.3.0

File hashes

Hashes for gngram_lookup-1.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9256a77f04a45297efb19f9ee56dc4fcb6ad65ee27f296ba67ef6748a5ff0ba0
MD5 5d363431e516ca1855d5ed6e59ea6e4f
BLAKE2b-256 dd524055fedf4c318a1c4942ca9d30fb19457e4c8e606265ba0829d9ef8c3c00

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page