Skip to main content

Static Hash-Based Lookup for BNC Terms

Project description

BNC Lookup

PyPI version PyPI downloads Python versions License Code style: ruff Pre-commit Tests

Is this token a word? O(1) answer. No setup. No dependencies.

A simple question deserves a simple answer. This library gives you instant yes/no validation against 669,000 word forms from the British National Corpus.

Quick Start

pip install bnc-lookup
from bnc_lookup import is_bnc_term

# That's it. Start validating.
is_bnc_term('the')          # True
is_bnc_term('however')      # True
is_bnc_term('nonetheless')  # True
is_bnc_term('xyzabc123')    # False

# Handles plurals automatically
is_bnc_term('computers')    # True

# Case insensitive
is_bnc_term('THE')          # True

Features

  • Zero Dependencies - Pure Python, no external packages
  • Zero I/O - No filesystem access, no database queries
  • Zero Setup - No corpus downloads or configuration
  • Microsecond Lookups - O(1) dictionary access
  • Smart Plurals - Automatically checks singular forms
  • Simple API - One function does it all

The Problem This Solves

In NLP, you frequently need to answer the question: "Is this token a real word?"

Not "what does it mean?" Not "give me synonyms." Just: is this a word?

is_bnc_term('computer') is_bnc_term('asdfgh')
True False

That's it. O(1) response. No ambiguity.

Why BNC?

The British National Corpus isn't an academic wordlist (too narrow). It's not a web scrape (too noisy). It's not slang (too ephemeral).

It's a 100-million-word corpus of real British English collected from written and spoken sources between 1991-1994. Books, newspapers, academic papers, conversations. The BNC frequency list captures ~669,000 unique word forms actually used by native speakers.

If a token passes the BNC test, you can be confident it's a word that real people actually use.

When to Use This

  • Tokenization filtering: Keep real words, discard garbage
  • Input validation: Reject nonsense in user input
  • NLP preprocessing: Filter candidates before expensive operations
  • Spell-check pre-filtering: Quick reject obvious non-words before fuzzy matching
  • Data cleaning: Identify malformed or corrupted text

What This Doesn't Do

  • No definitions, synonyms, or semantic relationships (use spaCy for that)
  • No frequency counts or rankings (just yes/no)
  • No spell-checking or suggestions (just existence check)

Documentation

For detailed usage, performance benchmarks, and advanced features, see the API Documentation.

Development

git clone https://github.com/craigtrim/bnc-lookup.git
cd bnc-lookup
make install  # Install dependencies
make test     # Run tests
make all      # Full build pipeline

See API Documentation for detailed development information.

License

This package is dual-licensed:

  • Software: MIT License
  • BNC Data: BNC User Licence

See LICENSE for complete terms.

Attribution

This package contains data derived from the British National Corpus frequency lists:

BNC frequency lists compiled by Adam Kilgarriff. Source: https://www.kilgarriff.co.uk/BNClists/all.num.gz

The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium.

Note: This is a static snapshot of BNC frequency data. The data is not automatically updated.

See Also

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bnc_lookup-1.0.7.tar.gz (12.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bnc_lookup-1.0.7-py3-none-any.whl (12.5 MB view details)

Uploaded Python 3

File details

Details for the file bnc_lookup-1.0.7.tar.gz.

File metadata

  • Download URL: bnc_lookup-1.0.7.tar.gz
  • Upload date:
  • Size: 12.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for bnc_lookup-1.0.7.tar.gz
Algorithm Hash digest
SHA256 7a2824c72a9916552fac725366d08d500da5630c0ffc0730ff8bf2b8c647093c
MD5 fa76caacc96841de2e30a853add3d8ad
BLAKE2b-256 7a7022fecc58983f748dfa3b96e16938377f6d1678a35b61c8767c0ad706dac1

See more details on using hashes here.

File details

Details for the file bnc_lookup-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: bnc_lookup-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 12.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for bnc_lookup-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 e364c5aa70ffd5fde41c937e46ac96762a8b9899ff98ccd08ec6833250760bc7
MD5 2ee1eb9acc4da2f5f17826632b2018df
BLAKE2b-256 9ab01cee1707e43c7dcc151bfdc1ba9693e3c819903ff6e5d81169dcc45274b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page