Skip to main content

A super-fast canonical name lookup service

Project description

juditha on pypi Python test and package pre-commit Coverage Status MIT License

juditha

A super-fast lookup service for canonical names based on redis and configurable fallback upstream sources (currently Aleph and Wikipedia).

juditha wants to solve the noise/garbage problem occurring when working with Named Entity Recognition. Given the availability of huge lists of known names, such as company registries or lists of persons of interest, one could canonize ner-results against this service to check if they are known.

The implementation uses a pre-populated redis cache which can fallback to other sources.

quickstart

pip install juditha

start local redis

docker run -p 6379:6379  redis

populate

echo "Jane Doe\nAlice" | juditha load

lookup

juditha lookup "jane doe"
"Jane Doe"

To match more fuzzy, reduce the threshold (default 0.97):

juditha lookup "doe, jane" --threshold 0.5
"Jane Doe"

data import

from ftm entities

cat entities.ftm.json | juditha load --from-entities

from anywhere

juditha load -i s3://my_bucket/names.txt
juditha load -i https://data.ftm.store/eu_authorities/entities.ftm.json --from-entities

a complete dataset or catalog

Following the nomenklatura specification, a dataset json config needs names.txt or entities.ftm.json in its resources.

juditha load-dataset https://data.ftm.store/eu_authorities/index.json
juditha load-catalog https://data.ftm.store/investigraph/catalog.json

use in python applications

from juditha import lookup

assert lookup("jane doe") == "Jane Doe"
assert lookup("doe, jane") is None
assert lookup("doe, jane", threshold=0.5) == "Jane Doe"

run as api

uvicorn --port 8000 juditha.api:app --workers 8

api calls

Just do head requests to check if a name is known:

curl -I "http://localhost:8000/jane%20doe"
HTTP/1.1 200 OK

curl -I "http://localhost:8000/John"
HTTP/1.1 404 Not Found

An actual request returns the canonized name:

curl "http://localhost:8000/doe,%20jane?threshold=0.5"
Jane Doe

settings

set redis endpoint via environment variable:

REDIS_URL=redis://localhost:6379

sources

Create a yaml config:

sources:
  - klass: aleph
    config:
      host: https://aleph.investigativedata.org
      # api_key: ...
  - klass: aleph
    config:
      host: https://aleph.occrp.org
      # api_key: ...
  - klass: wikipedia
    config:
      url: https://de.wikipedia.org

Store this as a file (e.g. config.yml) and use it via env vars:

JUDITHA_CONFIG=config.yml juditha lookup "Juditha Dommer"

If a lookup is not found in redis, juditha would use the fallback sources in the given order to lookup names. The results are stored in redis for the next call.

use remote juditha

The juditha client can use a remote api endpoint of a deployed juditha:

JUDITHA=https://juditha.ftm.store juditha lookup "HIMATIC EXPLOTACIONES SL"
from juditha import Juditha

j = Juditha("https://juditha.ftm.store")
assert j.lookup("HIMATIC EXPLOTACIONES SL") is not None

the name

Juditha Dommer was the daughter of a coppersmith and raised seven children, while her husband Johann Pachelbel wrote a canon.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juditha-0.0.4.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

juditha-0.0.4-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file juditha-0.0.4.tar.gz.

File metadata

  • Download URL: juditha-0.0.4.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.7 Linux/6.5.0-5-amd64

File hashes

Hashes for juditha-0.0.4.tar.gz
Algorithm Hash digest
SHA256 a87ac388f1b17f0776c061a0271e7c3705c27bec8e4c8151ad61fa3f3a1ba557
MD5 e156dcb1cd573abb1cb8d4ce8e6c403c
BLAKE2b-256 053061507060cb7ded64b806a517ab908f1e1ff36fb9af00c2fbff1cfc0124ee

See more details on using hashes here.

File details

Details for the file juditha-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: juditha-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 25.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.7 Linux/6.5.0-5-amd64

File hashes

Hashes for juditha-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 cfc296d1e8cb0685b3fd11e89a4a834b9169accae7357aa52b694bc9b5563457
MD5 b1f227a464f7e5aa7570e852660d174c
BLAKE2b-256 38354c471ef987ae8006649243447b7bc283e6b5ddf0abdf8353b36de5578b3e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page