A search engine for Open Data.

These details have not been verified by PyPI

Project links

Homepage

Project description

Find Open Data

Screenshot

Table of Content:

Introduction
System Overview
Development Guide
Cloud Storage Systems
Crawler Guide

Introduction

This is the source code repository for findopendata.com. The project goal is to make a search engine for Open Data with rich features beyond simple keyword search. The current search methods are:

Keyword search based on metadata
Similar dataset search based on metadata similarity
Joinable table search based on content (i.e., data values) similarity using LSH index

Next steps:

Unionable/similar table search based on content similarity
Time and location-based serach based on extracted timestamps and Geo tags
Dataset versioning
API for external data science tools (e.g., Jupyter Notebook, Plot.ly)

This is a work in progress.

System Overview

The Find Open Data system has the following components:

Frontend: a React app, located in frontend.
API Server: a Flask web server, located in apiserver.
LSH Server: a Go web server, located in lshserver.
Crawler: a set of Celery tasks, located in findopendata.

The Frontend, the API Server, and the LSH Server can be deployed to Google App Engine.

We also use two external storage systems for persistence:

A PostgreSQL database for storing dataset registry, metadata, and sketches for content-based search.
A cloud-based storage system for storing dataset files, currently supporting Google Cloud Storage and Azure Blob Storage. A local storage using file system is also available.

System Overview

Development Guide

To develop locally, you need the following:

PostgreSQL 9.6 or above
RabbitMQ

1. Install PostgreSQL

PostgreSQL (version 9.6 or above) is used by the crawler to register and save the summaries of crawled datasets. It is also used by the API Server as the database backend. If you are using Cloud SQL Postgres, you need to download Cloud SQL Proxy and make it executable.

Once the PostgreSQL database is running, create a database, and use the SQL scripts in sql to create tables:

psql -f sql/create_crawler_tables.sql
psql -f sql/create_metadata_tables.sql
psql -f sql/create_sketch_tables.sql

2. Install RabbitMQ

RabbitMQ is required to manage and queue crawl tasks. On Mac OS X you can install it using Homebrew.

Run the RabbitMQ server after finishing install.

3. Python Environment

It is recommended to use virtualenv for Python development and dependencies:

virtualenv -p python3 .venv
source .venv/bin/activate # .\venv\bin\activate on Windows

python-snappy requires libsnappy. On Ubuntu you can simply install it by sudo apt-get install libsnappy-dev. On Mac OS X use brew install snappy. On Windows, instead of the python-snappy binary on Pypi, use the unofficial binary maintained by UC Irvine (download here), and install directly, for example (Python 3.7, amd64):

pip install python_snappy‑0.5.4‑cp37‑cp37m‑win_amd64.whl

Finally, install this package and other dependencies:

pip install -e .

4. Configuration File

Create a configs.yaml by copying configs-example.yaml, complete fields related to PostgreSQL and storage.

If you plan to store all datasets on your local file system, you can skip the gcp and azure sections and only complete the local section, and make sure the storage.provider is set to local.

For cloud-based storage systems, see Cloud Storage Systems.

Cloud Storage Systems

Currently we support using Google Cloud Storage and Azure Blob Storage as the dataset storage system.

To use Google Cloud Storage, you need:

A Google Cloud project with Cloud Storage enabled, and a bucket created.
A Google Cloud service account key file (JSON formatted) with read and write access to the Cloud Storage bucket.
Set storage.provider to gcp in configs.yaml.

To use Azure Blob Storage, you need:

An Azure storage account enabled, and a blob storage container created.
A connection string to access the storage account.
Set storage.provider to azure in configs.yaml.

Crawler Guide

The crawler has a set of Celery tasks that runs in parallel. It uses the RabbitMQ server to manage and queue the tasks.

Setup Crawler

Data Sources (CKAN and Socrata APIs)

The crawler uses PostgreSQL to maintain all data sources. CKAN sources are maintained in the table findopendata.ckan_apis. Socrata Discovery APIs are maintained in the table findopendata.socrata_discovery_apis. The SQL script sql/create_crawler_tables.sql has already created some initial sources for you.

To show the CKAN APIs currently available to the crawler and whether they are enabled:

SELECT * FROM findopendata.ckan_apis;

To add a new CKAN API and enable it:

INSERT INTO findopendata.ckan_apis (endpoint, name, region, enabled) VALUES
('catalog.data.gov', 'US Open Data', 'United States', true);

Socrata App Tokens

Add your Socrata app tokens to the table findopendata.socrata_app_tokens. The app tokens are required for harvesting datasets from Socrata APIs.

For example:

INSERT INTO findopendata.socrata_app_tokens (token) VALUES ('<your app token>');

Run Crawler

Celery workers are processes that fetch crawler tasks from RabbitMQ and execute them. The worker processes must be started before starting any tasks.

For example:

celery -A findopendata worker -l info -Ofair

On Windows there are some issues with using prefork process pool. Use gevent instead:

celery -A findopendata worker -l info -Ofair -P gevent

Harvest Datasets

Run harvest_datasets.py to start data harvesting tasks that download datasets from various data sources. Downloaded datasets will be stored on a Google Cloud Storage bucket (set in configs.yaml), and registed in Postgres tables findopendata.ckan_packages and findopendata.socrata_resources.

Generate Metadata

Run generate_metadata.py to start metadata generation tasks for downloaded and registed datasets in findopendata.ckan_packages and findopendata.socrata_resources tables.

It generates metadata by extracting titles, description etc. and annotates them with entities for enrichment. The metadata is stored in table findopendata.packages, which is also used by the API server to serve the frontend.

Sketch Dataset Content

Run sketch_dataset_content.py to start tasks for creating sketches (e.g., MinHash, samples, data types, etc.) of dataset content (i.e., data values, columns, and records). The sketches will be used for content-based search such as finding joinable tables.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.5

Feb 8, 2023

1.0.4

Jan 21, 2023

1.0.3

Dec 26, 2019

1.0.2

Dec 20, 2019

1.0.1

Dec 19, 2019

1.0.0

Nov 14, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

findopendata-1.0.5.tar.gz (36.2 kB view details)

Uploaded Feb 8, 2023 Source

Built Distribution

findopendata-1.0.5-py3-none-any.whl (39.4 kB view details)

Uploaded Feb 8, 2023 Python 3

File details

Details for the file findopendata-1.0.5.tar.gz.

File metadata

Download URL: findopendata-1.0.5.tar.gz
Upload date: Feb 8, 2023
Size: 36.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for findopendata-1.0.5.tar.gz
Algorithm	Hash digest
SHA256	`b6c6fe762c28b5bb86262dd389c0b51f890f65102de00fcd84dcf3027806cb4a`
MD5	`5f4e54a987c51549cf852f2c4888f84c`
BLAKE2b-256	`78c65be42649f77c156a9e3253487dff171eac49a8a9fdf5942f4335803e8c23`

See more details on using hashes here.

File details

Details for the file findopendata-1.0.5-py3-none-any.whl.

File metadata

Download URL: findopendata-1.0.5-py3-none-any.whl
Upload date: Feb 8, 2023
Size: 39.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for findopendata-1.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`927cc8b87fb7cc263ef37e50ee207787136cd00ab61f3bc30846401666ff10aa`
MD5	`fb4e4eecef93cc92ad610c3b687e2659`
BLAKE2b-256	`f71f0bcc0940a07cadaed587d01b079f1072202ad81b059f01dfb80d9c4f4a8d`

See more details on using hashes here.

findopendata 1.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Find Open Data

Introduction

System Overview

Development Guide

1. Install PostgreSQL

2. Install RabbitMQ

3. Python Environment

4. Configuration File

Cloud Storage Systems

Crawler Guide

Setup Crawler

Data Sources (CKAN and Socrata APIs)

Socrata App Tokens

Run Crawler

Harvest Datasets

Generate Metadata

Sketch Dataset Content

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes