Skip to main content

Identify locations and tag them with UN-LOCODEs and ISO-3166-2 subdivisions.

Project description

berlin-rs

A Python/Rust microservice to identify locations and tag them with UN-LOCODEs and ISO-3166-2 subdivisions.

Getting started

To test the Rust API locally:

  make run

This will make an API available on port 3001. It serves simple requests of the form:

curl 'http://localhost:3001/berlin/search?q=house+prices+in+londo&state=gb' | jq

replacing localhost with the local endpoint (jq used for formatting).

This will return results of the form:

{
  "time": "32.46ms",
  "query": {
    "raw": "house prices in londo",
    "normalized": "house prices in londo",
    "stop_words": [
      "in"
    ],
    "codes": [],
    "exact_matches": [
      "house"
    ],
    "not_exact_matches": [
      "house prices",
      "prices in",
      "prices",
      "in londo",
      "londo"
    ],
    "state_filter": "gb",
    "limit": 1,
    "levenshtein_distance": 2
  },
  "results": [
    {
      "loc": {
        "encoding": "UN-LOCODE",
        "id": "gb:lon",
        "key": "UN-LOCODE-gb:lon",
        "names": [
          "london"
        ],
        "codes": [
          "lon"
        ],
        "state": [
          "gb",
          "united kingdom of great britain and northern ireland"
        ],
        "subdiv": [
          "lnd",
          "london, city of"
        ]
      },
      "score": 1346,
      "offset": {
        "start": 16,
        "end": 21
      }
    }
  ]
}

A Python wheel can also be built, using

  make wheels
  pip install build/wheels/berlin-0.1.0-xyz.whl

where xyz is your architecture.

Afterwards berlin should be functional inside a python shell/script. Example:

import berlin

db = berlin.load('../data')
loc = db.query('manchester population', 'gb', 1)[0];
print("location:", loc.words)

Description

Berlin is a location search engine which works on an in-memory collection of all UN Locodes, subdivisions and states (countries). Here are the main architectural highlights: On startup Berlin does a basic linguistic analysis of the locations: split names into words, remove diacritics, transliterate non-ASCII symbols to ASCII. For example, this allows us to find “Las Vegas” when searching for “vegas”. It employs string interning in order to both optimise memory usage and allow direct lookups for exact matches. If we can resolve (parts of) the search term to an existing interned string, it means that we have a location with this name in the database.

When the user submits the search term, Berlin first does a preliminary analysis of the search term: 1) split into words and pairs of words 2) try to identify the former as existing locations (can be resolved to existing interned strings) and tag them as “exact matches”. This creates many search terms from the original phrase. Pre-filtering step. Here we do three things 1) resolve exact matches by direct lookup in the names and codes tables 2) do a prefix search via a finite-state transducer 3) do a fuzzy search via a Levenshtein distance enabled finite-state transducer. The pre-filtered results are passed through a string-similarity evaluation algorithm and sorted by score. The results below a threshold are truncated. A graph is built from the locations found during the previous step in order to link them together hierarchically if possible. This further boosts some locations. For example, if the user searches for “new york UK” it will boost the location in Lincolnshire and it will show up higher than New York city in the USA. It is also possible to request search only in a specific country (which is enabled by default for the UK)

Berlin is able to find locations with a high degree of semantic accuracy. Speed is roughly equal to 10-15 ms per every non-matching word (or typo) + 1 ms for every exact match. A complex query of 8 words usually takes less than 100 ms and all of the realistic queries in our test suite take less than 50 ms, while the median is under 30 ms. Short queries containing an exact match (case insensitive) are faster than 10 ms.

The architecture would allow to easily implement as-you-type search suggestions in under 10 milliseconds if deemed desirable.

License

Prepared by Flax & Teal Limited for ONS Alpha project. Copyright © 2022, Office for National Statistics (https://www.ons.gov.uk)

Released under MIT license, see LICENSE for details.

License

Prepared by Flax & Teal Limited for ONS Alpha project. Copyright © 2022, Office for National Statistics (https://www.ons.gov.uk)

Released under MIT license, see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

berlin-0.3.4-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.0 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

berlin-0.3.4-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.0 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

berlin-0.3.4-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.0 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

berlin-0.3.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.0 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

berlin-0.3.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

berlin-0.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

berlin-0.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

berlin-0.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.0 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file berlin-0.3.4-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.4-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 537edd94eda5a1b2101d051a1673ea6e55853536ba93a22de601168cad56d483
MD5 99c0ad9dc8fd7baee5a60d44705887e0
BLAKE2b-256 34ca7861b4c7af69e331bd3dc780829f574971a8cc8196e5f4668ae776f9cb2d

See more details on using hashes here.

File details

Details for the file berlin-0.3.4-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.4-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8d0d76a6dd84132477ef0974d4b0725c62437b4a1be8a01523fb01a0ceb7e00a
MD5 fff04ec70d7f4792ca42e20b1a3f79aa
BLAKE2b-256 e5ac8bf3abf1860493787a39ba4f6db023aec8f0cb5ea99aa9d5f0a310c337bc

See more details on using hashes here.

File details

Details for the file berlin-0.3.4-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.4-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 81bd46b444bd4d31068f91d14cfcd9498a01aaff691c6e0c741833963c3da159
MD5 33dbc5c9f7e9a4a1db011487115f5198
BLAKE2b-256 1e13c07be8abbd627b1beb84a96c56dbeb5eda53cbf37c08d58630f9816ab489

See more details on using hashes here.

File details

Details for the file berlin-0.3.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 908e869a0627d4281c7483637e28feaa17fa0861ad5db3008c3403b8963cc6ab
MD5 4ee02348b0b30ce6fe1ccdc896df9ae0
BLAKE2b-256 af712b46facac631f631819803a4dd2d6e1fb893dc2da43ee95c64c7dc13a464

See more details on using hashes here.

File details

Details for the file berlin-0.3.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c68960c39cfe1892cd2d3faf4d328bde90c07a8e15ed41282a31387f4935275f
MD5 683275d2979675e248dda5e78b23941d
BLAKE2b-256 02f6270fbdf970bf7b2fb1680ae3322f35e481df0fbd1b7814f63c709a03c14e

See more details on using hashes here.

File details

Details for the file berlin-0.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c574dc6bff568bac42785f09139cd192dc1efe3ddb38e083f412c0783f8e63fe
MD5 11096f636c5b819ded81576514f37c1c
BLAKE2b-256 8651da27d530b1fb5febf0fb4d1cebf623a58184b30aa4c00f2f4b2c278bc116

See more details on using hashes here.

File details

Details for the file berlin-0.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9bf81bc59d4e25fc6e7c67829342b63cc5cf4e407b26e271b6bc4221b901b703
MD5 f51da192767a16843375320001e447de
BLAKE2b-256 1e4457ddc832d330b0dc88bf39175f6daa39f8e06d20a12ccfa151895be54a7f

See more details on using hashes here.

File details

Details for the file berlin-0.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2bd947163a77e2ece84a406598c65fe4743a3e2b2318ef61ba74adf9b0728fc9
MD5 82a7452667527234b81cde8cd7bcec30
BLAKE2b-256 eaf4dd97abe034b7dc26f99b313690607405adc9999895d17b4da0beed4ddc69

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page