Skip to main content

Created for ONS. Proof-of-concept mmap'd Rust word2vec implementation linked with category matching

Project description

bonn-py

NLP Category-Matching tools

A Rust microservice to match queries on the ONS Website to groupings in the ONS taxonomy

Getting started

Set up taxonomy.json

This should be adapted from the taxonomy.json.example and placed in the root directory.

Download or create embeddings

These are most simply sourced as pretrained fifu models, but can be dynamically generated using the embedded FinalFusion libraries.

To build wheels for distribution, use:

make

Configuration

Configuration

Environment variable Default Description
CATEGORY_API_HOST 0.0.0.0 Host
CATEGORY_API_PORT 28800 Port that the API is listening on
CATEGORY_API_DUMMY_RUN false Returns empty list for testing purposes
CATEGORY_API_DEBUG_LEVEL_FOR_DYNACONF "DEBUG" Verbosity of dynaconf internal logging
CATEGORY_API_ENVVAR_PREFIX_FOR_DYNACONF "CATEGORY_API" The prefix of which variables to be taken into dynaconf configuration
CATEGORY_API_FIFU_FILE "test_data/wiki.en.fifu" The location of the final fusion file
CATEGORY_API_THRESHOLD 0.4 Threshold of what's considered a low-scoring category
CATEGORY_API_CACHE_S3_BUCKET S3 for bucket for cache files in format "s3://"
--------core variables------------ --------- -----------
BONN_CACHE_TARGET "cache.json" Cache target
BONN_ELASTICSEARCH_HOST "http://localhost:9200" Elasticsearch host
BONN_REBUILD_CACHE true Should cache be rebuild
BONN_TAXONOMY_LOCATION "test_data/taxonomy.json" Location of taxonomy
BONN_ELASTICSEARCH_INDEX "ons1639492069322" Location of taxonomy
BONN_WEIGHTING__C 1 Word vectors based on the words in the category name
BONN_WEIGHTING__SC 2 Word vectors based on the words in the sub-categories name
BONN_WEIGHTING__SSC 2 Word vectors based on the words in the sub-sub-categories name
BONN_WEIGHTING__WC 6 Based on a bag of words found in the metadata of the datasets found in the categories
BONN_WEIGHTING__WSSC 8 Based on a bag of words found in the metadata of the datasets found in the sub-sub-categories

Manual building

Quick Local Setup

  1. setup .env file - $ cp .env.local .env

  2. make wheels

  3. make sure you've placed taxonomy.json in the root folder (This should be obtained from ONS).

  4. [TODO: genericize] you need an elasticsearch container forwarded to port:9200 (you can customize the port in .env) with a dump matching the appropriate schema https://gitlab.com/flaxandteal/onyx/dp-search-api in this readme you can checkout how to setup elasticsearch.

Install finalfusion utils

cd core
RUSTFLAGS="-C link-args=-lcblas -llapack" cargo install finalfusion-utils --features=opq

Optional: Convert the model to quantized fifu format

Note: if you try to use the full wiki bin you'll need about 128GB of RAM...

finalfusion quantize -f fasttext -q opq <fasttext.bin> fasttext.fifu.opq

Install deps and build

poetry shell
cd core
poetry install
cd ../api
poetry install
exit

Run

poetry run python -c "from bonn import FfModel; FfModel('test_data/wiki.en.fifu').eval('Hello')"

Create cache

You can create a cache with the following command:

poetry run python -m bonn.extract

This assumes that the correct environment variables for the NLP model, taxonomy and Elasticsearch are set.

Algorithm

The following requirements were identified:

  • Fast response to live requests
  • Low running resource requirements, as far as possible
  • Ability to limit risk of unintended bias in results, and making results explainable
  • Minimal needed preprocessing of data (at least for first version)
  • Non-invasive - ensuring that the system can enhance existing work by ONS teams, with minimal changes required to incorporate
  • Runs effectively and reproducibly in ONS workflows

We found that the most effective approach was to use the standard Wikipedia unstructured word2vec model as the ML basis.

This has an additional advantage that we have been able to prototype incorporating other language category matching into the algorithm, although further work is required, including manual review by native speakers and initial results suggest that a larger language corpus would be required for training.

Using finalfusion libraries in Rust enables mmapping for memory efficiency.

Category Vectors

A bag of words is formed, to make a vector for the category - a weighted average of the terms, according to the attribute contributing it:

Grouping Score basis
Category (top-level) Literal words within title
Subcategory (second-level) Literal words within title
Subsubcategory (third-level) Literal words within title
Related words across whole category Common thematic words across all datasets within the category
Related words across subsubcategory Common thematic words across all datasets within the subsubcategory

To build a weighted bag of words, the system finds thematically-distinctive words occurring in dataset titles and descriptions present in the categories, according to the taxonomy. The "thematic distinctiveness" of words in a dataset description is defined by exceeding a similarity threshold to terms in the category title.

These can then be compared to search queries word-by-word, obtaining a score for each taxonomy entry, for a given phrase.

Scoring Adjustment

In addition to the direct cosine similarity of these vectors, we:

  • remove any stopwords from the search scoring, with certain additional words that should not affect the category matching (“data”, “statistics”, “measure(s)”)
  • apply an overall significance boost for a category, using the magnitude of the average word vector for its bag as a proxy for how “significant” it is that it matches a query phrase (so categories that match overly frequently, such as “population”, are slightly deprioritized)
  • enhance or reduce contribution from each of the words in the query based on their commonality across categories.

To do the last, a global count of (lemmatized) words appearing in dataset descriptions/titles across all categories is made, and common terms are deprioritized within the bag according to an exponential decay function - this allows us to rely more heavily on words that strongly signpost a category (such as “education” or “school”) without being confounded by words many categories contain (such as “price” or “economic”).

Once per-category scores for a search phrase are obtained, we filter them based on:

  • appearance thresholds, to ensure we only return matches over a minimal viable score;
  • a signal-to-noise ratio filter (SNR) that returns a small number of notably high-scoring categories or a larger group of less distinguishable top scorers, according to a supplied SNR ratio.

License

Prepared by Flax & Teal Limited for ONS Alpha project. Copyright © 2022, Office for National Statistics (https://www.ons.gov.uk)

Released under MIT license, see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bonn-0.1.5.tar.gz (42.2 kB view hashes)

Uploaded Source

Built Distributions

bonn-0.1.5-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

bonn-0.1.5-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.4 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

bonn-0.1.5-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

bonn-0.1.5-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

bonn-0.1.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

bonn-0.1.5-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.5+ i686

bonn-0.1.5-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

bonn-0.1.5-pp38-pypy38_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.4 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

bonn-0.1.5-pp38-pypy38_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

bonn-0.1.5-pp38-pypy38_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

bonn-0.1.5-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

bonn-0.1.5-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.5+ i686

bonn-0.1.5-cp312-none-win_amd64.whl (359.6 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

bonn-0.1.5-cp312-none-win32.whl (347.4 kB view hashes)

Uploaded CPython 3.12 Windows x86

bonn-0.1.5-cp312-cp312-macosx_11_0_arm64.whl (471.0 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

bonn-0.1.5-cp312-cp312-macosx_10_12_x86_64.whl (497.2 kB view hashes)

Uploaded CPython 3.12 macOS 10.12+ x86-64

bonn-0.1.5-cp311-none-win_amd64.whl (359.0 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

bonn-0.1.5-cp311-none-win32.whl (347.6 kB view hashes)

Uploaded CPython 3.11 Windows x86

bonn-0.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

bonn-0.1.5-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.4 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ s390x

bonn-0.1.5-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ppc64le

bonn-0.1.5-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARMv7l

bonn-0.1.5-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

bonn-0.1.5-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.5+ i686

bonn-0.1.5-cp311-cp311-macosx_11_0_arm64.whl (470.5 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

bonn-0.1.5-cp311-cp311-macosx_10_12_x86_64.whl (496.9 kB view hashes)

Uploaded CPython 3.11 macOS 10.12+ x86-64

bonn-0.1.5-cp310-none-win_amd64.whl (359.0 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

bonn-0.1.5-cp310-none-win32.whl (347.6 kB view hashes)

Uploaded CPython 3.10 Windows x86

bonn-0.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

bonn-0.1.5-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.4 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ s390x

bonn-0.1.5-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ppc64le

bonn-0.1.5-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARMv7l

bonn-0.1.5-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

bonn-0.1.5-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.5+ i686

bonn-0.1.5-cp310-cp310-macosx_11_0_arm64.whl (470.5 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

bonn-0.1.5-cp310-cp310-macosx_10_12_x86_64.whl (496.9 kB view hashes)

Uploaded CPython 3.10 macOS 10.12+ x86-64

bonn-0.1.5-cp39-none-win_amd64.whl (359.4 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

bonn-0.1.5-cp39-none-win32.whl (347.7 kB view hashes)

Uploaded CPython 3.9 Windows x86

bonn-0.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

bonn-0.1.5-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.4 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ s390x

bonn-0.1.5-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ppc64le

bonn-0.1.5-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARMv7l

bonn-0.1.5-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

bonn-0.1.5-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.5+ i686

bonn-0.1.5-cp38-none-win_amd64.whl (359.1 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

bonn-0.1.5-cp38-none-win32.whl (346.7 kB view hashes)

Uploaded CPython 3.8 Windows x86

bonn-0.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

bonn-0.1.5-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.4 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ s390x

bonn-0.1.5-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ppc64le

bonn-0.1.5-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARMv7l

bonn-0.1.5-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

bonn-0.1.5-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.5+ i686

bonn-0.1.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

bonn-0.1.5-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.4 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ s390x

bonn-0.1.5-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ ppc64le

bonn-0.1.5-cp37-cp37m-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ ARMv7l

bonn-0.1.5-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ ARM64

bonn-0.1.5-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.5+ i686

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page