Skip to main content

Identify locations and tag them with UN-LOCODEs and ISO-3166-2 subdivisions.

Project description

berlin-rs

A Python/Rust microservice to identify locations and tag them with UN-LOCODEs and ISO-3166-2 subdivisions.

Getting started

To test the Rust API locally:

  make run

This will make an API available on port 3001. It serves simple requests of the form:

curl 'http://localhost:3001/berlin/search?q=house+prices+in+londo&state=gb' | jq

replacing localhost with the local endpoint (jq used for formatting).

This will return results of the form:

{
  "time": "32.46ms",
  "query": {
    "raw": "house prices in londo",
    "normalized": "house prices in londo",
    "stop_words": [
      "in"
    ],
    "codes": [],
    "exact_matches": [
      "house"
    ],
    "not_exact_matches": [
      "house prices",
      "prices in",
      "prices",
      "in londo",
      "londo"
    ],
    "state_filter": "gb",
    "limit": 1,
    "levenshtein_distance": 2
  },
  "results": [
    {
      "loc": {
        "encoding": "UN-LOCODE",
        "id": "gb:lon",
        "key": "UN-LOCODE-gb:lon",
        "names": [
          "london"
        ],
        "codes": [
          "lon"
        ],
        "state": [
          "gb",
          "united kingdom of great britain and northern ireland"
        ],
        "subdiv": [
          "lnd",
          "london, city of"
        ]
      },
      "score": 1346,
      "offset": {
        "start": 16,
        "end": 21
      }
    }
  ]
}

A Python wheel can also be built, using

  make wheels
  pip install build/wheels/berlin-0.1.0-xyz.whl

where xyz is your architecture.

Afterwards berlin should be functional inside a python shell/script. Example:

import berlin

db = berlin.load('../data')
loc = db.query('manchester population', 'gb', 1)[0];
print("location:", loc.words)

Description

Berlin is a location search engine which works on an in-memory collection of all UN Locodes, subdivisions and states (countries). Here are the main architectural highlights: On startup Berlin does a basic linguistic analysis of the locations: split names into words, remove diacritics, transliterate non-ASCII symbols to ASCII. For example, this allows us to find “Las Vegas” when searching for “vegas”. It employs string interning in order to both optimise memory usage and allow direct lookups for exact matches. If we can resolve (parts of) the search term to an existing interned string, it means that we have a location with this name in the database.

When the user submits the search term, Berlin first does a preliminary analysis of the search term: 1) split into words and pairs of words 2) try to identify the former as existing locations (can be resolved to existing interned strings) and tag them as “exact matches”. This creates many search terms from the original phrase. Pre-filtering step. Here we do three things 1) resolve exact matches by direct lookup in the names and codes tables 2) do a prefix search via a finite-state transducer 3) do a fuzzy search via a Levenshtein distance enabled finite-state transducer. The pre-filtered results are passed through a string-similarity evaluation algorithm and sorted by score. The results below a threshold are truncated. A graph is built from the locations found during the previous step in order to link them together hierarchically if possible. This further boosts some locations. For example, if the user searches for “new york UK” it will boost the location in Lincolnshire and it will show up higher than New York city in the USA. It is also possible to request search only in a specific country (which is enabled by default for the UK)

Berlin is able to find locations with a high degree of semantic accuracy. Speed is roughly equal to 10-15 ms per every non-matching word (or typo) + 1 ms for every exact match. A complex query of 8 words usually takes less than 100 ms and all of the realistic queries in our test suite take less than 50 ms, while the median is under 30 ms. Short queries containing an exact match (case insensitive) are faster than 10 ms.

The architecture would allow to easily implement as-you-type search suggestions in under 10 milliseconds if deemed desirable.

License

Prepared by Flax & Teal Limited for ONS Alpha project. Copyright © 2022, Office for National Statistics (https://www.ons.gov.uk)

Released under MIT license, see LICENSE for details.

License

Prepared by Flax & Teal Limited for ONS Alpha project. Copyright © 2022, Office for National Statistics (https://www.ons.gov.uk)

Released under MIT license, see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

berlin-0.3.0.tar.gz (15.6 kB view details)

Uploaded Source

Built Distributions

berlin-0.3.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl (7.4 MB view details)

Uploaded PyPy manylinux: glibc 2.5+ x86-64

berlin-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

berlin-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

berlin-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

berlin-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.0 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

berlin-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

berlin-0.3.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.5+ x86-64

File details

Details for the file berlin-0.3.0.tar.gz.

File metadata

  • Download URL: berlin-0.3.0.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for berlin-0.3.0.tar.gz
Algorithm Hash digest
SHA256 1ca1989d6cc115b8d144c562c6648e13eeda592aa66fa291221d75762c55bb0b
MD5 55cd73593b3d724074ffdb96bad61e76
BLAKE2b-256 15bc167df28ce1748f8fcfda6ec10286144703668f1898eb7d3d4bfbdfb7f071

See more details on using hashes here.

File details

Details for the file berlin-0.3.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 352d551be2a697c90e301766e82890a15415f2d49d2ca89711082150b9684d08
MD5 391ada5d43e149604b6d74725abed0df
BLAKE2b-256 ea99d5e30ca3e271759639569bf3eb1099d362916f4c4e6d00f7faa183f0f037

See more details on using hashes here.

File details

Details for the file berlin-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 83049d96caeb41263e5480e01c7e2d0ace4590b4e0605f54267c70cc045232b0
MD5 bb3af075c9b98546eeb2526403147fc5
BLAKE2b-256 c31855e41ff309993c7183f7a731c97b16079b7b1333495659fdf84a59474776

See more details on using hashes here.

File details

Details for the file berlin-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 256c10ec9c8ceab1b3b7cb8a2b950969d0c4ef8009f3b6119ee486af9b157939
MD5 2a94ad1a21c5d7377e1a64b37333ba03
BLAKE2b-256 7c2bc14ba99752a5de0414b57ce9658f1e133c66646c1afb5e1d43f87b466b06

See more details on using hashes here.

File details

Details for the file berlin-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 668af6f1b7ce4397d3b1d488e71358a2b3c88e5ec185cbee9467aff98f2265c3
MD5 5adb1f84b955e4b48a4669acc714fa22
BLAKE2b-256 213af245574369c6dc66f69b9c12ef2a4596bc19ecb0c08afaf2dcea31f0e5c4

See more details on using hashes here.

File details

Details for the file berlin-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 adde121e7b4e66c8ee48b1b7120826e14aba14f437ccf9d4f7ea13e01333a951
MD5 820b08d789d8d59ba6e7d2922e332495
BLAKE2b-256 fb9063a17410432141d7c63d0329ce78534d5ea5bdec65f6f38fdb0bc04771ca

See more details on using hashes here.

File details

Details for the file berlin-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d57125a398ca8244eebc33fbabc913c7725e54ea76fcbbe62a7c10ac4975e0d0
MD5 ea7f2966d6d8bd8fea84463cc3f84ebf
BLAKE2b-256 ef3685a6efcbb638c31e4c1aa26abd7495a70ed437b400ecdf4c6a9afae199cc

See more details on using hashes here.

File details

Details for the file berlin-0.3.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for berlin-0.3.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9510b5f2bf60fa79aefc2d1fe546b14cb8c68f0e82c87edcb16eef8993646b4f
MD5 b07c6955830a747e427e710c9979fe06
BLAKE2b-256 d06ef8508f038957e3d79b64aefcbb603ac0f86d65c080bfea1c8e2b8aaac5e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page