Skip to main content

Static Hash-Based Lookup for Google Ngram Frequencies

Project description

gngram-lookup

PyPI version Downloads Downloads/Month Tests Python 3.9+ License: MIT

How common is this word? O(1) answer. 500 years of books. 5 million words.

Word frequency, commonness scores, and part-of-speech tags derived from the Google Books Ngram corpus, the largest longitudinal word-frequency dataset ever compiled.

Quick Start

pip install gngram-lookup
python -m gngram_lookup.download_data       # frequency data, ~110 MB
python -m gngram_lookup.download_pos_data   # POS tag data, optional
import gngram_lookup as ng

# Does this word exist in 500 years of print?
ng.exists('computer')        # True
ng.exists('xyznotaword')     # False

# How common is it? (1=most common, 100=least common)
ng.word_score('the')         # 1
ng.word_score('computer')    # 18
ng.word_score('rucksack')    # 58
ng.word_score('xyznotaword') # None

# Full frequency data
ng.frequency('computer')
# {'peak_tf': 2000, 'peak_df': 2000, 'sum_tf': 892451, 'sum_df': 312876}

# Part-of-speech tags
ng.pos('fast')                        # ['.', 'ADJ', 'ADP', 'ADV', 'CONJ', 'DET', ...]
ng.pos('fast', min_tf=1_000_000)      # ['ADJ', 'ADV', 'NOUN']  — dominant tags only
ng.has_pos('sing', ng.PosTag.VERB)    # True

Features

  • O(1) Lookups - Hash-bucketed parquet files, no full scan
  • 5 Million Words - Broadest vocabulary coverage of any static lookup package
  • 500 Years of Data - 1500s through 2000s, decade-by-decade
  • Word Score - 1–100 commonness scale, log-normalized against the corpus
  • Peak Decade - Know when a word was most used in print
  • POS Tags - Part-of-speech tags from Google's own tag set
  • Batch Lookup - Efficient multi-word queries grouped by hash prefix
  • Possessive & Hyphen Fallback - ship'sship, north-westnorth
  • Unicode Normalization - Smart quotes, accents, and Unicode apostrophes handled

Word Score Zones

Words are scored 1–100 based on their total corpus frequency, log-normalized against the (the most frequent word). The scale compresses the Zipfian spike at the top and gives meaningful resolution across the rest of the vocabulary.

gngram-lookup Word Score Zones

Score Zone Description Examples
1–5 Zipf Function words dominating millions of books the, of, and, to, a
6–20 Core Everyday words known by all speakers computer, walk, beautiful
21–40 Literary Vocabulary of books, not everyday speech algorithm, philosophy, synthesis
41–60 Specialized Domain-specific but attested rucksack, carbonate, heliotrope
61–80 Rare Infrequent but legitimate arcane technical and literary terms
81–100 Long Tail Extremely low frequency niche, archaic, or highly specialized
import gngram_lookup as ng

# "the data philosopher defenestrated eigenvalues into petrichor"
ng.word_score('the')            # 1   → Zipf
ng.word_score('data')           # 19  → Core
ng.word_score('philosopher')    # 32  → Literary
ng.word_score('eigenvalue')     # 43  → Specialized
ng.word_score('defenestrated')  # 67  → Rare
ng.word_score('petrichor')      # 82  → Long Tail

# Filter to common words only
def is_common(word):
    score = ng.word_score(word)
    return score is not None and score <= 30

The Problem This Solves

In NLP, you frequently need to know not just whether a token is a word, but how common it is.

Not "what does it mean?" Not "what's its etymology?" Just: is this a word real people actually write, and how often?

No other static lookup package gives you this from a corpus spanning 500 years of print at 5 million word coverage with O(1) access.

Why Google Books Ngrams?

The Google Books Ngram corpus isn't a dictionary (too narrow). It isn't a web crawl (too noisy). It isn't limited to a single decade or language register.

It's a statistical analysis of over 8 million books spanning the 1500s to the 2000s, one of the largest datasets ever assembled for studying how language changes over time. The frequency data captures real written usage at a scale no other freely available corpus matches.

If a word appears in this corpus with meaningful frequency, it's a word that real writers actually used.

When to Use This

  • Vocabulary filtering - Score words to keep common ones, discard rare ones
  • NLP preprocessing - Filter candidates before expensive model calls
  • Content analysis - Measure the reading level or register of a text
  • Spell-check pre-filtering - Reject low-scoring tokens before fuzzy matching
  • Temporal analysis - Find when a word peaked in usage
  • POS disambiguation - Resolve ambiguous tokens using corpus-attested tags

What This Doesn't Do

  • No definitions, synonyms, or semantic relationships (use spaCy or WordNet for that)
  • No spell-checking or suggestions (just existence and frequency)
  • No real-time or web data (this is a static corpus snapshot)

CLI

exists computer       # True, exit 0
exists xyznotaword    # False, exit 1

score computer        # 18
score xyznotaword     # None, exit 1

freq computer
# peak_tf_decade: 2000
# peak_df_decade: 2000
# sum_tf: 892451
# sum_df: 312876

pos fast              # ADJ ADV VERB
pos-freq corn         # ADJ: 1,433,642 / NOUN: 11,722,803 / VERB: 85,411
has-pos sing VERB     # True, exit 0
has-pos fast NOUN     # False, exit 1

Documentation

Development

git clone https://github.com/craigtrim/gngram-lookup.git
cd gngram-lookup
make install   # Install dependencies
make test      # Run tests
make all       # Full pipeline

See Also

  • bnc-lookup - O(1) lookup for British National Corpus (669k words, zero setup)
  • wordnet-lookup - O(1) lookup for WordNet

Attribution

Data derived from the Google Books Ngram dataset.

License

This package is dual-licensed:

  • Software: MIT License
  • Ngram Data: Creative Commons Attribution 3.0 Unported (CC BY 3.0)

See LICENSE for complete terms.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gngram_lookup-1.2.9.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gngram_lookup-1.2.9-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file gngram_lookup-1.2.9.tar.gz.

File metadata

  • Download URL: gngram_lookup-1.2.9.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.9 Darwin/25.3.0

File hashes

Hashes for gngram_lookup-1.2.9.tar.gz
Algorithm Hash digest
SHA256 dd36f3e91e4e8a904c0a7a043f1e25669ea9a67c7efd3ae227b6e1f75f31dae9
MD5 4fe454486f2653ec883bc4f298b1b343
BLAKE2b-256 165d5e00010832c9655e9d11682ec64e215422be123bd47cefd00b3de1b5cf15

See more details on using hashes here.

File details

Details for the file gngram_lookup-1.2.9-py3-none-any.whl.

File metadata

  • Download URL: gngram_lookup-1.2.9-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.9 Darwin/25.3.0

File hashes

Hashes for gngram_lookup-1.2.9-py3-none-any.whl
Algorithm Hash digest
SHA256 726303845f3f10162817d4a656ef7592dbd4c9a5d1291e6738680f49d57b9768
MD5 218224e1bba4f86203bb6c4b51dc86c2
BLAKE2b-256 98d816df8a90c891d6e9c18878a250ebdd8e6e887f3353dddc29c24a1a12e669

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page