Skip to main content

Library for semantic annotation of tabular data with the Wikidata knowledge graph

Project description

bbw (boosted by wiki)

PyPI version Binder

  • Annotates tabular data with the entities, types and properties in Wikidata.
  • Easy to use: bbw.annotate().
  • Resolves even tricky spelling mistakes via meta-lookup through SearX.
  • Matches to the up-to-date values in Wikidata without the dump files.
  • Ranked in third place at SemTab2020.

Table of contents

How to use

Import library

from bbw import bbw

The easiest way to annotate the dataframe Y is:

[web_table, url_table, label_table, cpa, cea, cta] = bbw.annotate(Y)

It returns a list of six dataframes. The first three dataframes contain the annotations in the form of HTML-links, URLs and labels of the entities in Wikidata correspondingly. The dataframes have two more rows than Y. These two rows contain the annotations for types and properties. The last three dataframes contain the annotations in the format required by SemTab2020 challenge.

The fastest way to annotate the dataframe Y is:

[cpa_list, cea_list, nomatch] = bbw.contextual_matching(bbw.preprocessing(Y))
[cpa, cea, cta] = bbw.postprocessing(cpa_list, cea_list)

The dataframes cpa, cea and cta contain the annotations in SemTab2020-format. The list nomatch contains the labels which are not matched. The unprocessed and possibly non-unique annotations are in the lists cpa_list and cea_list.

GUI

If you need to annotate only one table, use the simple GUI:

streamlit run bbw_gui.py

Open the browser at http://localhost:8501 and choose a CSV-file. The annotation process starts automatically. It outputs the six tables of the annotate function.

CLI

If you need to annotate a few tables, use the CLI-tool:

python3 bbw_cli.py --amount 100 --offset 0

GNU parallel

If you need to annotate hundreds or thousands of tables, use the script with GNU parallel:

./bbw_parallel.py

Installation

You can use pip to install bbw:

pip install bbw

The latest version can be installed directly from github:

pip install git+https://github.com/UB-Mannheim/bbw

You can test bbw in a virtual environment:

pip install virtualenv
virtualenv testing_bbw
source testing_bbw/bin/activate
python
from bbw import bbw
[web_table, url_table, label_table, cpa, cea, cta] = bbw.annotate(bbw.pd.DataFrame([['0','1'],['Mannheim','Rhine']]))
print(web_table)
deactivate

Install also SearX, because bbw meta-lookups through it.

export PORT=80
docker pull searx/searx
docker run --rm -d -v ${PWD}/searx:/etc/searx -p $PORT:8080 -e BASE_URL=http://localhost:$PORT/ searx/searx

SearX is running on http://localhost:80. bbw sends GET requests to it.

Citing

If you find bbw useful in your work, a proper reference would be:

@inproceedings{2020_bbw,
  author    = {Renat Shigapov and Philipp Zumstein and Jan Kamlah and Lars Oberl{\"a}nder and J{\"o}rg Mechnich and Irene Schumm},
  title     = {bbw: {M}atching {CSV} to {W}ikidata via {M}eta-lookup},
  booktitle = {SemTab@ISWC 2020},
  year = {2020}
}

SemTab2020

The library was designed, implemented and tested during SemTab2020. It received the best scores in the last 4th round at automatically generated dataset:

Task F1-score Precision Rank
CPA 0.995 0.996 2
CTA 0.980 0.980 2
CEA 0.978 0.984 4

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bbw-0.1.1.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bbw-0.1.1-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file bbw-0.1.1.tar.gz.

File metadata

  • Download URL: bbw-0.1.1.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.52.0 CPython/3.8.6

File hashes

Hashes for bbw-0.1.1.tar.gz
Algorithm Hash digest
SHA256 436455e2194c9ebc3cf951b27f0125ca3ce49ff20d64d93bc092e5c6b13b8e38
MD5 0baba889bbf28d7f359e6eb9d721ebbc
BLAKE2b-256 734f994b57be268974d853742ab911f928467a898e3f05b4f2fffb0d8f647028

See more details on using hashes here.

File details

Details for the file bbw-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: bbw-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.52.0 CPython/3.8.6

File hashes

Hashes for bbw-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 19c626171522a7d1ae088329be655b9b566c630c7afac961b8b45d90bea240b7
MD5 2c5f86e707345421d358800e2294e3aa
BLAKE2b-256 3e7f8e84145dcaa95e0fc5fc479669fb51c5362a4421283836560f7af0c60e9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page