Skip to main content

Utilities for building and querying an Unsplash-style OpenSearch index

Project description

unsplash-lite-dataset-api

Utilities for exporting Unsplash-style photo metadata from Postgres into an OpenSearch index and querying it programmatically.

Features

  • Environment-driven configuration helpers for Postgres and OpenSearch clients.
  • Document extraction utilities that assemble rich photo documents ready for indexing.
  • Index management helpers with synonym-aware analyzers and bulk ingestion support.
  • Query helpers for end-user search flows, including color filters and keyword boosting.
  • A CLI (files-unsplash-index) for end-to-end ingestion using your configured environments.
  • Optional tools for generating large synonym lists from the NLTK WordNet corpus.

Installation

pip install .

The package requires Python 3.9 or later. Installing in editable mode during development is also supported:

pip install -e .[dev]

The [dev] extra installs pytest for running the included tests.

Configuration

Set the following environment variables (a .env file is supported automatically):

  • PG_HOST, PG_PORT, PG_DB, PG_USER, PG_PASSWORD
  • OPENSEARCH_HOST, OPENSEARCH_PORT, OPENSEARCH_USE_SSL, OPENSEARCH_VERIFY_CERTS, OPENSEARCH_REGION
  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, optional AWS_SESSION_TOKEN
  • Optional: OPENSEARCH_CONNECT_TIMEOUT

Command-line usage

The unsplash-lite-dataset-api CLI provides subcommands for all major operations.

Indexing

Generate (or supply) a synonyms file and run the indexer:

unsplash-lite-dataset-api index \
  --synonyms-path ./synonyms.txt \
  --index-name unsplash_photos \
  --batch-size 500

Searching

Search the index:

unsplash-lite-dataset-api search \
  --index-name unsplash_photos \
  --query-text "blue ocean sunset" \
  --size 10

For pagination, use --from to specify the offset:

unsplash-lite-dataset-api search \
  --index-name unsplash_photos \
  --query-text "blue ocean sunset" \
  --size 10 \
  --from 20

Extracting documents

Extract photo documents from Postgres:

unsplash-lite-dataset-api extract --output photos.json

Generating synonyms

Generate a synonyms file from WordNet:

unsplash-lite-dataset-api synonyms --output ./synonyms.txt --include-hyponyms

Index management

Create an empty index:

unsplash-lite-dataset-api create-index --synonyms-path ./synonyms.txt

Delete an index:

unsplash-lite-dataset-api delete-index --index-name unsplash_photos

For backwards compatibility, you can still run:

python -m main_index

which now delegates to the CLI's index command using synonyms.txt located next to the script.

Library usage

from unsplash_lite_dataset_api import (
    load_postgres_config,
    load_opensearch_config,
    create_pg_connection,
    create_opensearch_client,
    generate_documents,
    load_synonyms_from_file,
    build_index,
)

pg_cfg = load_postgres_config()
os_cfg = load_opensearch_config()

with create_pg_connection(pg_cfg) as pg_conn:
    os_client = create_opensearch_client(os_cfg)
    synonyms = load_synonyms_from_file("./synonyms.txt")
    build_index(
        client=os_client,
        conn=pg_conn,
        index_name="unsplash_photos",
        synonyms=synonyms,
    )

For searching:

from unsplash_lite_dataset_api import create_opensearch_client, load_opensearch_config, search_images

client = create_opensearch_client(load_opensearch_config())
results = search_images(
    client,
    index_name="unsplash_photos",
    query_text="blue ocean sunset",
    size=10,
    from_=20,  # For pagination
)

Synonym generation

Use the WordNet helpers to build a synonyms file when you do not already have one:

from pathlib import Path
from unsplash_lite_dataset_api import generate_wordnet_synonyms_file

target = Path("./synonyms.txt")
generate_wordnet_synonyms_file(target)

Ensure the NLTK wordnet and omw-1.4 corpora are installed locally. If they are missing, the helper raises a detailed WordnetInitializationError describing how to fix the environment.

Testing

Run the unit tests with:

pytest

The tests cover the search query builder and synonym loader utilities.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unsplash_lite_dataset_api-0.1.1.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unsplash_lite_dataset_api-0.1.1-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file unsplash_lite_dataset_api-0.1.1.tar.gz.

File metadata

File hashes

Hashes for unsplash_lite_dataset_api-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8e1d056b5ad1602a566b6e754cf966941eebeec293108c25e62fb2a41b260603
MD5 e45634cfd32dc8e233de235ba468c946
BLAKE2b-256 a79228a8e63006dce4020491247cb3f70422df4117a289dd38fdb67472461326

See more details on using hashes here.

File details

Details for the file unsplash_lite_dataset_api-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for unsplash_lite_dataset_api-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9357eab612a00ad576a2b9f2c83812cb697908d02b50359f8dec383d2834c00c
MD5 2600d59b36667bf771e8d3bad9d8ba68
BLAKE2b-256 27f7ea25925f8c7ff6d4122bf18524bec3d17c92f12d0bef90c72d2c24d04f7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page