Skip to main content

Utilities for building and querying an Unsplash-style OpenSearch index

Project description

unsplash-lite-dataset-api

Utilities for exporting Unsplash-style photo metadata from Postgres into an OpenSearch index and querying it programmatically.

Features

  • Environment-driven configuration helpers for Postgres and OpenSearch clients.
  • Document extraction utilities that assemble rich photo documents ready for indexing.
  • Index management helpers with synonym-aware analyzers and bulk ingestion support.
  • Query helpers for end-user search flows, including color filters and keyword boosting.
  • A CLI (files-unsplash-index) for end-to-end ingestion using your configured environments.
  • Optional tools for generating large synonym lists from the NLTK WordNet corpus.

Installation

pip install .

The package requires Python 3.9 or later. Installing in editable mode during development is also supported:

pip install -e .[dev]

The [dev] extra installs pytest for running the included tests.

Configuration

Set the following environment variables (a .env file is supported automatically):

  • PG_HOST, PG_PORT, PG_DB, PG_USER, PG_PASSWORD
  • OPENSEARCH_HOST, OPENSEARCH_PORT, OPENSEARCH_USE_SSL, OPENSEARCH_VERIFY_CERTS, OPENSEARCH_REGION
  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, optional AWS_SESSION_TOKEN
  • Optional: OPENSEARCH_CONNECT_TIMEOUT

Command-line usage

The unsplash-lite-dataset-api CLI provides subcommands for all major operations.

Indexing

Generate (or supply) a synonyms file and run the indexer:

unsplash-lite-dataset-api index \
  --synonyms-path ./synonyms.txt \
  --index-name unsplash_photos \
  --batch-size 500

Searching

Search the index:

unsplash-lite-dataset-api search \
  --index-name unsplash_photos \
  --query-text "blue ocean sunset" \
  --size 10

Extracting documents

Extract photo documents from Postgres:

unsplash-lite-dataset-api extract --output photos.json

Generating synonyms

Generate a synonyms file from WordNet:

unsplash-lite-dataset-api synonyms --output ./synonyms.txt --include-hyponyms

Index management

Create an empty index:

unsplash-lite-dataset-api create-index --synonyms-path ./synonyms.txt

Delete an index:

unsplash-lite-dataset-api delete-index --index-name unsplash_photos

For backwards compatibility, you can still run:

python -m main_index

which now delegates to the CLI's index command using synonyms.txt located next to the script.

Library usage

from unsplash_lite_dataset_api import (
    load_postgres_config,
    load_opensearch_config,
    create_pg_connection,
    create_opensearch_client,
    generate_documents,
    load_synonyms_from_file,
    build_index,
)

pg_cfg = load_postgres_config()
os_cfg = load_opensearch_config()

with create_pg_connection(pg_cfg) as pg_conn:
    os_client = create_opensearch_client(os_cfg)
    synonyms = load_synonyms_from_file("./synonyms.txt")
    build_index(
        client=os_client,
        conn=pg_conn,
        index_name="unsplash_photos",
        synonyms=synonyms,
    )

For searching:

from unsplash_lite_dataset_api import create_opensearch_client, load_opensearch_config, search_images

client = create_opensearch_client(load_opensearch_config())
results = search_images(
    client,
    index_name="unsplash_photos",
    query_text="blue ocean sunset",
    size=10,
)

Synonym generation

Use the WordNet helpers to build a synonyms file when you do not already have one:

from pathlib import Path
from unsplash_lite_dataset_api import generate_wordnet_synonyms_file

target = Path("./synonyms.txt")
generate_wordnet_synonyms_file(target)

Ensure the NLTK wordnet and omw-1.4 corpora are installed locally. If they are missing, the helper raises a detailed WordnetInitializationError describing how to fix the environment.

Testing

Run the unit tests with:

pytest

The tests cover the search query builder and synonym loader utilities.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unsplash_lite_dataset_api-0.1.0.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unsplash_lite_dataset_api-0.1.0-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file unsplash_lite_dataset_api-0.1.0.tar.gz.

File metadata

File hashes

Hashes for unsplash_lite_dataset_api-0.1.0.tar.gz
Algorithm Hash digest
SHA256 085e95699edec007920dde69960a1ece150a2b2622ddd2047a88f3d045689bad
MD5 d211b572b65efc6257a5861ee0d7a8dc
BLAKE2b-256 51f8a203596e4056557fc1be3c793902e0826b37b945f2e3376a84e03a4b83bf

See more details on using hashes here.

File details

Details for the file unsplash_lite_dataset_api-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for unsplash_lite_dataset_api-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b78c214ee15d5750db8834f1f0bbd1eeaee940d1dfc0b8139d41b48a22738c53
MD5 ecf22c2ee8d57257af563e1e04459c0a
BLAKE2b-256 a0aa988569a8a542d037f9cb334334772028445a06f4f7f826f98bda121423d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page