Static Hash-Based Lookup for BNC Terms
Project description
BNC Lookup
Is this token a word? O(1) answer. No setup. No dependencies.
A simple question deserves a simple answer. This library gives you instant yes/no validation against 669,000 word forms from the British National Corpus.
Quick Start
pip install bnc-lookup
from bnc_lookup import is_bnc_term
# That's it. Start validating.
is_bnc_term('the') # True
is_bnc_term('however') # True
is_bnc_term('nonetheless') # True
is_bnc_term('xyzabc123') # False
# Handles plurals automatically
is_bnc_term('computers') # True
# Case insensitive
is_bnc_term('THE') # True
Features
- Zero Dependencies - Pure Python, no external packages
- Zero I/O - No filesystem access, no database queries
- Zero Setup - No corpus downloads or configuration
- Microsecond Lookups - O(1) dictionary access
- Smart Plurals - Automatically checks singular forms
- Simple API - One function does it all
The Problem This Solves
In NLP, you frequently need to answer the question: "Is this token a real word?"
Not "what does it mean?" Not "give me synonyms." Just: is this a word?
is_bnc_term('computer') |
is_bnc_term('asdfgh') |
True |
False |
That's it. O(1) response. No ambiguity.
Why BNC?
The British National Corpus isn't an academic wordlist (too narrow). It's not a web scrape (too noisy). It's not slang (too ephemeral).
It's a 100-million-word corpus of real British English collected from written and spoken sources between 1991-1994. Books, newspapers, academic papers, conversations. The BNC frequency list captures ~669,000 unique word forms actually used by native speakers.
If a token passes the BNC test, you can be confident it's a word that real people actually use.
When to Use This
- Tokenization filtering: Keep real words, discard garbage
- Input validation: Reject nonsense in user input
- NLP preprocessing: Filter candidates before expensive operations
- Spell-check pre-filtering: Quick reject obvious non-words before fuzzy matching
- Data cleaning: Identify malformed or corrupted text
What This Doesn't Do
- No definitions, synonyms, or semantic relationships (use spaCy for that)
- No frequency counts or rankings (just yes/no)
- No spell-checking or suggestions (just existence check)
Documentation
For detailed usage, performance benchmarks, and advanced features, see the API Documentation.
How It Works
BNC terms are stored as MD5 hash suffixes in 256 frozenset buckets (by first two hex characters of the hash). Lookups hash the input, route to the correct bucket, and perform O(1) set membership. Modules are lazy-loaded on first access per bucket.
For the gory details, see Implementation Notes.
Development
git clone https://github.com/craigtrim/bnc-lookup.git
cd bnc-lookup
make install # Install dependencies
make test # Run tests
make all # Full build pipeline
See API Documentation for detailed development information.
License
This package is dual-licensed:
- Software: MIT License
- BNC Data: BNC User Licence
See LICENSE for complete terms.
Attribution
This package contains data derived from the British National Corpus frequency lists:
BNC frequency lists compiled by Adam Kilgarriff. Source: https://www.kilgarriff.co.uk/BNClists/all.num.gz
The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium.
Note: This is a static snapshot of BNC frequency data. The data is not automatically updated.
Links
- Repository: github.com/craigtrim/bnc-lookup
- PyPI: pypi.org/project/bnc-lookup
- BNC: natcorp.ox.ac.uk
- Author: Craig Trim (craigtrim@gmail.com)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bnc_lookup-1.0.5.tar.gz.
File metadata
- Download URL: bnc_lookup-1.0.5.tar.gz
- Upload date:
- Size: 12.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79e2034b934e650068d11d1d444455e64a751daa86eb3d74a214cf1d078818c4
|
|
| MD5 |
c3f8ca99846e5195441c1fcb0339c9bc
|
|
| BLAKE2b-256 |
d778e2a30cd71c66803c041572a0467a1f4335386522309f5efba3ec4b539679
|
File details
Details for the file bnc_lookup-1.0.5-py3-none-any.whl.
File metadata
- Download URL: bnc_lookup-1.0.5-py3-none-any.whl
- Upload date:
- Size: 12.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4b40f004daa4fd971dbbc39f57a50816ec6dc5dc6b3cee515fe7b5f396e793a
|
|
| MD5 |
f1dfc428c4fbbd9b14e45d5583de4263
|
|
| BLAKE2b-256 |
7ea71c7b5e4887fe1abed0f8bec98d19f221b825adcaf3a54f98ce35096b4f50
|