A search engine for Open Data.
Find Open Data
Table of Content:
This is the source code repository for findopendata.com. The project goal is to make a search engine for Open Data with rich features beyond simple keyword search. The current search methods are:
- Keyword search based on metadata
- Similar dataset search based on metadata similarity
- Joinable table search based on content (i.e., data values) similarity using LSH index
- Unionable/similar table search based on content similarity
- Time and location-based serach based on extracted timestamps and Geo tags
- Dataset versioning
- API for external data science tools (e.g., Jupyter Notebook, Plot.ly)
This is a work in progress.
The Find Open Data system has the following components:
- Frontend: a React app, located in
- API Server: a Flask web server, located in
- LSH Server: a Go web server, located in
- Crawler: a set of Celery tasks, located in
The Frontend, the API Server, and the LSH Server can be deployed to Google App Engine.
We also use two external storage systems for persistence:
- A PostgreSQL database for storing dataset registry, metadata, and sketches for content-based search.
- A cloud-based storage system for storing dataset files, currently supporting Google Cloud Storage and Azure Blob Storage. A local storage using file system is also available.
To develop locally, you need the following:
- PostgreSQL 9.6 or above
1. Install PostgreSQL
PostgreSQL (version 9.6 or above) is used by the crawler to register and save the summaries of crawled datasets. It is also used by the API Server as the database backend. If you are using Cloud SQL Postgres, you need to download Cloud SQL Proxy and make it executable.
Once the PostgreSQL database is running, create a database, and
use the SQL scripts in
sql to create tables:
psql -f sql/create_crawler_tables.sql psql -f sql/create_metadata_tables.sql psql -f sql/create_sketch_tables.sql
2. Install RabbitMQ
Run the RabbitMQ server after finishing install.
3. Python Environment
We use virtualenv for Python development and dependencies:
virtualenv -p python3 pyenv pip install -r requirements.txt
libsnappy. On Ubuntu you can
simply install it by
sudo apt-get install libsnappy-dev.
On Mac OS X use
brew install snappy.
On Windows, instead of the
python-snappy binary on Pypi, use the
unofficial binary maintained by UC Irvine
and install directly, for example (Python 3.7, amd64):
pip install python_snappyâ€‘0.5.4â€‘cp37â€‘cp37mâ€‘win_amd64.whl
4. Configuration File
configs.yaml by copying
configs-example.yaml, complete fields
related to PostgreSQL and storage.
If you plan to store all datasets on your local file system,
you can skip the
azure sections and only complete
local section, and make sure the
For cloud-based storage systems, see Cloud Storage Systems.
Cloud Storage Systems
To use Google Cloud Storage, you need:
- A Google Cloud project with Cloud Storage enabled, and a bucket created.
- A Google Cloud service account key file (JSON formatted) with read and write access to the Cloud Storage bucket.
To use Azure Blob Storage, you need:
- An Azure storage account enabled, and a blob storage container created.
- A connection string to access the storage account.
The crawler has a set of Celery tasks that runs in parallel. It uses the RabbitMQ server to manage and queue the tasks.
Data Sources (CKAN and Socrata APIs)
The crawler uses PostgreSQL to maintain all data sources.
CKAN sources are maintained in the table
Socrata Discovery APIs are maintained in the table
The SQL script
sql/create_crawler_tables.sql has already created some
initial sources for you.
To show the CKAN APIs currently available to the crawler and whether they are enabled:
SELECT * FROM findopendata.ckan_apis;
To add a new CKAN API and enable it:
INSERT INTO findopendata.ckan_apis (endpoint, name, region, enabled) VALUES ('catalog.data.gov', 'US Open Data', 'United States', true);
Socrata App Tokens
Add your Socrata app tokens
to the table
The app tokens are required for harvesting datasets from Socrata APIs.
INSERT INTO findopendata.socrata_app_tokens (token) VALUES ('<your app token>');
Celery workers are processes that fetch crawler tasks from RabbitMQ and execute them. The worker processes must be started before starting any tasks.
celery -A findopendata worker -l info -Ofair
On Windows there are some issues with using prefork process pool.
celery -A findopendata worker -l info -Ofair -P gevent
harvest_datasets.py to start data harvesting tasks that download
datasets from various data sources. Downloaded datasets will be stored on
a Google Cloud Storage bucket (set in
configs.yaml), and registed in
generate_metadata.py to start metadata generation tasks for
downloaded and registed datasets in
It generates metadata by extracting titles, description etc. and
annotates them with entities for enrichment.
The metadata is stored in table
findopendata.packages, which is
also used by the API server to serve the frontend.
Sketch Dataset Content
sketch_dataset_content.py to start tasks for creating
samples, data types, etc.) of dataset
content (i.e., data values, columns, and records).
The sketches will be used for content-based search such as
finding joinable tables.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size findopendata-1.0.0-py3-none-any.whl (26.9 kB)||File type Wheel||Python version py3||Upload date||Hashes View hashes|
|Filename, size findopendata-1.0.0.tar.gz (21.1 kB)||File type Source||Python version None||Upload date||Hashes View hashes|
Hashes for findopendata-1.0.0-py3-none-any.whl