Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

xrendan

These details have not been verified by PyPI

Project links

Project description

datasette-metasearch

Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette.

Motivation

Different datasets may contain overlapping data but may not conform to the same schema. Some datasets may contain more information than others that we want to display when searching across datasets. We might also want to see statistics or facets across those datasettes (like how many records in a particular year or made by a particular person). datasette-metasearch enables this pattern with a config file, rather than trying to build a pipeline to transform each dataset into a common format then building a bespoke query interface for those, we can specify the fields we want to index and search, how to transform them into the com o

The motivation for this was to join government spending datasets so they can be easily queried.

Example

A live example of this plugin is running at https://datasette.io/-/beta - configured using this YAML file.

Read more about how this example works in Building a search engine for datasette.io.

Installation

Install this tool like so:

$ pip install datasette-metasearch

Usage

Run the indexer using the datasette-metasearch command-line tool:

$ datasette-metasearch index dogsheep.db config.yml

The config.yml file contains details of the databases and document types that should be indexed:

NOTE: the database storing the search index must be different from the ones containing the data to be indexed

twitter.db:
    tweets:
        sql: |-
            select
                tweets.id as key,
                'Tweet by @' || users.screen_name as title,
                tweets.created_at as timestamp,
                tweets.full_text as search_1
            from tweets join users on tweets.user = users.id
    users:
        sql: |-
            select
                id as key,
                name || ' @' || screen_name as title,
                created_at as timestamp,
                description as search_1
            from users

This will create a search_index table in the dogsheep.db database populated by data from those SQL queries.

By default the search index that this tool creates will be configured for Porter stemming. This means that searches for words like run will match documents containing runs or running.

If you don't want to use Porter stemming, use the --tokenize none option:

$ datasette-metasearch index dogsheep.db config.yml --tokenize none

You can pass other SQLite tokenize argumenst here, see the SQLite FTS tokenizers documentation.

Columns

The columns that can be returned by our query are:

key - a unique (within that type) primary key
title - the title for the item
timestamp - an ISO8601 timestamp, e.g. 2020-09-02T21:00:21
search_1 - a larger chunk of text to be included in the search index
category - an integer category ID, see below
is_public - an integer (0 or 1, defaults to 0 if not set) specifying if this is public or not

Public records are things like your public tweets, blog posts and GitHub commits.

Datasette plugin

Run datasette install datasette-metasearch (or use pip install datasette-metasearch in the same environment as Datasette) to install the Dogsheep Beta Datasette plugin.

Once installed, a custom search interface will be made available at /-/beta. You can use this interface to execute searches.

The Datasette plugin has some configuration options. You can set these by adding the following to your metadata.json configuration file:

{
    "plugins": {
        "datasette-metasearch": {
            "database": "beta",
            "config_file": "datasette-metasearch.yml",
            "template_debug": true
        }
    }
}

The configuration settings for the plugin are:

database - the database file that contains your search index. If the file is beta.db you should set database to beta.
config_file - the YAML file containing your Dogsheep Beta configuration.
template_debug - set this to true to enable debugging output if errors occur in your custom templates, see below.

Custom results display

Each indexed item type can define custom display HTML as part of the config.yml file. It can do this using a display key containing a fragment of Jinja template, and optionally a display_sql key with extra SQL to execute to fetch the data to display.

Here's how to define a custom display template for a tweet:

twitter.db:
    tweets:
        sql: |-
            select
                tweets.id as key,
                'Tweet by @' || users.screen_name as title,
                tweets.created_at as timestamp,
                tweets.full_text as search_1
            from tweets join users on tweets.user = users.id
        display: |-
            <p>{{ title }} - tweeted at {{ timestamp }}</p>
            <blockquote>{{ search_1 }}</blockquote>

This example reuses the value that were stored in the search_index table when the indexing query was run.

To load in extra values to display in the template, use a display_sql query like this:

twitter.db:
    tweets:
        sql: |-
            select
                tweets.id as key,
                'Tweet by @' || users.screen_name as title,
                tweets.created_at as timestamp,
                tweets.full_text as search_1
            from tweets join users on tweets.user = users.id
        display_sql: |-
            select
                users.screen_name,
                tweets.full_text,
                tweets.created_at
            from
                tweets join users on tweets.user = users.id
            where
                tweets.id = :key
        display: |-
            <p>{{ display.screen_name }} - tweeted at {{ display.created_at }}</p>
            <blockquote>{{ display.full_text }}</blockquote>

The display_sql query will be executed for every search result, passing the key value from the search_index table as the :key parameter and the user's search term as the :q parameter.

This performs well because many small queries are efficient in SQLite.

If an error occurs while rendering one of your templates the search results page will return a 500 error. You can use the template_debug configuration setting described above to instead output debugging information for the search results item that experienced the error.

Displaying maps

This plugin will eventually include a number of useful shortcuts for rendering interesting content.

The first available shortcut is for displaying maps. Make your custom content output something like this:

<div
    data-map-latitude="{{ display.latitude }}"
    data-map-longitude="{{ display.longitude }}"
    style="display: none; float: right; width: 250px; height: 200px; background-color: #ccc;"
></div>

JavaScript on the page will look for any elements with data-map-latitude and data-map-longitude and, if it finds any, will load Leaflet and convert those elements into maps centered on that location. The default zoom level will be 12, or you can set a data-map-zoom attribute to customize this.

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:

cd datasette-metasearch
python3 -mvenv venv
source venv/bin/activate

Or if you are using pipenv:

pipenv shell

Now install the dependencies and tests:

pip install -e '.[test]'

To run the tests:

pytest

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

xrendan

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Mar 20, 2025

This version

1.0.1rc4 pre-release

Mar 31, 2025

1.0.1rc3 pre-release

Mar 31, 2025

1.0.1rc2 pre-release

Mar 24, 2025

1.0.1rc1 pre-release

Mar 23, 2025

1.0.1rc0 pre-release

Mar 21, 2025

1.0.0

Mar 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasette_metasearch-1.0.1rc4.tar.gz (17.1 kB view details)

Uploaded Mar 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datasette_metasearch-1.0.1rc4-py3-none-any.whl (11.9 kB view details)

Uploaded Mar 31, 2025 Python 3

File details

Details for the file datasette_metasearch-1.0.1rc4.tar.gz.

File metadata

Download URL: datasette_metasearch-1.0.1rc4.tar.gz
Upload date: Mar 31, 2025
Size: 17.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for datasette_metasearch-1.0.1rc4.tar.gz
Algorithm	Hash digest
SHA256	`a082ee7f81c2c5100de740abf2ddbeb2464659cee26a311431b4c90638f2b814`
MD5	`0d2f9e7ef53a97babd5d6b30aa0c96bb`
BLAKE2b-256	`f667b82816b586e6bbedac276fe3afdfd1f3dcfa1f22ea8a01c5e9b8c16e8806`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datasette_metasearch-1.0.1rc4.tar.gz:

Publisher: publish.yml on xrendan/datasette-metasearch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datasette_metasearch-1.0.1rc4.tar.gz
- Subject digest: a082ee7f81c2c5100de740abf2ddbeb2464659cee26a311431b4c90638f2b814
- Sigstore transparency entry: 190462419
- Sigstore integration time: Mar 31, 2025
Source repository:
- Permalink: xrendan/datasette-metasearch@2af6d6134238ed84ec34c97578f43fa6451d2222
- Branch / Tag: refs/heads/canada-spends
- Owner: https://github.com/xrendan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2af6d6134238ed84ec34c97578f43fa6451d2222
- Trigger Event: workflow_dispatch

File details

Details for the file datasette_metasearch-1.0.1rc4-py3-none-any.whl.

File metadata

Download URL: datasette_metasearch-1.0.1rc4-py3-none-any.whl
Upload date: Mar 31, 2025
Size: 11.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for datasette_metasearch-1.0.1rc4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da97f169cc53c300c8da137a191f3851f4405bf0e381861cdd0d67331a0674a6`
MD5	`2a65d44b0a3a6b998f35c62dec941c22`
BLAKE2b-256	`808e7caac114520d4a2ed7024efbe46a9cee2a4d547a2f57b1964e63d0927a29`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datasette_metasearch-1.0.1rc4-py3-none-any.whl:

Publisher: publish.yml on xrendan/datasette-metasearch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datasette_metasearch-1.0.1rc4-py3-none-any.whl
- Subject digest: da97f169cc53c300c8da137a191f3851f4405bf0e381861cdd0d67331a0674a6
- Sigstore transparency entry: 190462426
- Sigstore integration time: Mar 31, 2025
Source repository:
- Permalink: xrendan/datasette-metasearch@2af6d6134238ed84ec34c97578f43fa6451d2222
- Branch / Tag: refs/heads/canada-spends
- Owner: https://github.com/xrendan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2af6d6134238ed84ec34c97578f43fa6451d2222
- Trigger Event: workflow_dispatch

datasette-metasearch 1.0.1rc4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Project description

datasette-metasearch

Motivation

Example

Installation

Usage

Columns

Datasette plugin

Custom results display

Displaying maps

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance