Skip to main content

Static Hash-Based Lookup for BNC Terms

Project description

BNC Lookup

PyPI version Downloads Downloads/Month Python 3.7+ License: MIT Tests

Featured Article: Every Word Has a Price Tag — How word frequency data from the British National Corpus can transform your NLP pipelines.

Is this token a word? O(1) answer. No setup. No dependencies.

A simple question deserves a simple answer. This library gives you instant yes/no validation against 669,000 word forms from the British National Corpus, plus frequency ranking.

Quick Start

pip install bnc-lookup
import bnc_lookup as bnc

# Check if a word exists
bnc.exists('the')          # True
bnc.exists('however')      # True
bnc.exists('xyzabc123')    # False

# Get frequency bucket (1=most common, 100=least common)
bnc.bucket('the')          # 1
bnc.bucket('python')       # 4
bnc.bucket('qwerty')       # 12
bnc.bucket('xyzabc123')    # None (not found)

# Relative frequency (per-word precision)
bnc.relative_frequency('the')              # 0.0618
bnc.relative_frequency('shimmered')        # 9.79e-07

# Expected occurrences in a text of given length
bnc.expected_count('the', 50000)           # 3090.7
bnc.expected_count('the', 50000, rounded=True)  # 3091

# Handles plurals and case automatically
bnc.exists('computers')    # True
bnc.exists('THE')          # True

Features

  • Zero Dependencies - Pure Python, no external packages
  • Zero I/O - No filesystem access, no database queries
  • Zero Setup - No corpus downloads or configuration
  • Microsecond Lookups - O(1) dictionary access
  • Smart Plurals - Automatically checks singular forms
  • Frequency Ranking - 100 buckets from most to least common
  • Relative Frequency - Per-word precision for quantitative analysis
  • Expected Counts - Predict word occurrences in any text length
  • CLI Tools - bnc-exists, bnc-bucket, bnc-freq, bnc-expected

The Problem This Solves

In NLP, you frequently need to answer the question: "Is this token a real word?"

Not "what does it mean?" Not "give me synonyms." Just: is this a word?

bnc.exists('computer') bnc.exists('asdfgh')
True False

That's it. O(1) response. No ambiguity.

Frequency Buckets

Words are ranked into 100 buckets based on their frequency in the BNC corpus:

Bucket Description Examples
1 Most frequent (~6,700 words) the, of, and, is, computer
2-10 Very common algorithm, python, beautiful
11-50 Common qwerty, specialized terms
51-99 Less common Rare but valid words
100 Least frequent Obscure terms
import bnc_lookup as bnc

# Filter by frequency
def is_common_word(word):
    bucket = bnc.bucket(word)
    return bucket is not None and bucket <= 10

Why BNC?

The British National Corpus isn't an academic wordlist (too narrow). It's not a web scrape (too noisy). It's not slang (too ephemeral).

It's a 100-million-word corpus of real British English collected from written and spoken sources between 1991-1994. Books, newspapers, academic papers, conversations. The BNC frequency list captures ~669,000 unique word forms actually used by native speakers.

If a token passes the BNC test, you can be confident it's a word that real people actually use.

Real Words vs Dictionary Words

How much of real-world English is in the dictionary? We compared BNC against WordNet:

BNC Vocabulary Zones by WordNet Coverage

93% of common words (bucket 1-10) are in WordNet. But dictionaries miss proper nouns, technical terms, compounds, and domain jargon that appear constantly in real text.

That's the gap BNC fills. Full analysis

When to Use This

  • Tokenization filtering: Keep real words, discard garbage
  • Input validation: Reject nonsense in user input
  • NLP preprocessing: Filter candidates before expensive operations
  • Spell-check pre-filtering: Quick reject obvious non-words before fuzzy matching
  • Data cleaning: Identify malformed or corrupted text
  • Frequency-based filtering: Prefer common words over obscure ones

What This Doesn't Do

  • No definitions, synonyms, or semantic relationships (use spaCy for that)
  • No spell-checking or suggestions (just existence check)
  • No irregular plural handling ("mice" → "mouse")

CLI

bnc-exists the            # True (exit code 0)
bnc-bucket python         # 4
bnc-freq the              # 6.181373e-02
bnc-expected the 50000    # 3090.6865
bnc-expected the 50000 --rounded  # 3091

Documentation

For detailed usage, performance benchmarks, and advanced features, see the API Documentation.

Development

git clone https://github.com/craigtrim/bnc-lookup.git
cd bnc-lookup
make install  # Install dependencies
make test     # Run tests
make all      # Full build pipeline

See API Documentation for detailed development information.

License

This package is dual-licensed:

  • Software: MIT License
  • BNC Data: BNC User Licence

See LICENSE for complete terms.

Attribution

This package contains data derived from the British National Corpus frequency lists:

BNC frequency lists compiled by Adam Kilgarriff. Source: https://www.kilgarriff.co.uk/BNClists/all.num.gz

The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium.

Note: This is a static snapshot of BNC frequency data. The data is not automatically updated.

See Also

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bnc_lookup-1.3.6.tar.gz (42.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bnc_lookup-1.3.6-py3-none-any.whl (42.9 MB view details)

Uploaded Python 3

File details

Details for the file bnc_lookup-1.3.6.tar.gz.

File metadata

  • Download URL: bnc_lookup-1.3.6.tar.gz
  • Upload date:
  • Size: 42.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for bnc_lookup-1.3.6.tar.gz
Algorithm Hash digest
SHA256 b24763f63b635be4a5d3f39e48489bf1e6de5cf819c7a175d72a2a0e65b6faa0
MD5 d5eb8507f5bebdc53bfb928bd3f94ee0
BLAKE2b-256 3aa93784f5af1f6fd3b233bbc165f0576dd04ae7855a6358de9894b8ba472f42

See more details on using hashes here.

File details

Details for the file bnc_lookup-1.3.6-py3-none-any.whl.

File metadata

  • Download URL: bnc_lookup-1.3.6-py3-none-any.whl
  • Upload date:
  • Size: 42.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for bnc_lookup-1.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9c35e4959e8de4054990e1c069293f07e02320ffe8843ef7bbbe014ec9ae7b3c
MD5 7fcd4d275ddda867b2a0e349c0ae5244
BLAKE2b-256 b9be0a55b3adadc0d922370a7ca260f7c3b9ab629ea04f2e6667fe2e3e64e673

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page