Utilities for building and querying an Unsplash-style OpenSearch index
Project description
unsplash-lite-dataset-api
Utilities for exporting Unsplash-style photo metadata from Postgres into an OpenSearch index and querying it programmatically.
Features
- Environment-driven configuration helpers for Postgres and OpenSearch clients.
- Document extraction utilities that assemble rich photo documents ready for indexing.
- Index management helpers with synonym-aware analyzers and bulk ingestion support.
- Query helpers for end-user search flows, including color filters and keyword boosting.
- A CLI (
files-unsplash-index) for end-to-end ingestion using your configured environments. - Optional tools for generating large synonym lists from the NLTK WordNet corpus.
Installation
pip install .
The package requires Python 3.9 or later. Installing in editable mode during development is also supported:
pip install -e .[dev]
The [dev] extra installs pytest for running the included tests.
Configuration
Set the following environment variables (a .env file is supported automatically):
PG_HOST,PG_PORT,PG_DB,PG_USER,PG_PASSWORDOPENSEARCH_HOST,OPENSEARCH_PORT,OPENSEARCH_USE_SSL,OPENSEARCH_VERIFY_CERTS,OPENSEARCH_REGIONAWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, optionalAWS_SESSION_TOKEN- Optional:
OPENSEARCH_CONNECT_TIMEOUT
Command-line usage
The unsplash-lite-dataset-api CLI provides subcommands for all major operations.
Indexing
Generate (or supply) a synonyms file and run the indexer:
unsplash-lite-dataset-api index \
--synonyms-path ./synonyms.txt \
--index-name unsplash_photos \
--batch-size 500
Searching
Search the index:
unsplash-lite-dataset-api search \
--index-name unsplash_photos \
--query-text "blue ocean sunset" \
--size 10
For pagination, use --from to specify the offset:
unsplash-lite-dataset-api search \
--index-name unsplash_photos \
--query-text "blue ocean sunset" \
--size 10 \
--from 20
Extracting documents
Extract photo documents from Postgres:
unsplash-lite-dataset-api extract --output photos.json
Generating synonyms
Generate a synonyms file from WordNet:
unsplash-lite-dataset-api synonyms --output ./synonyms.txt --include-hyponyms
Index management
Create an empty index:
unsplash-lite-dataset-api create-index --synonyms-path ./synonyms.txt
Delete an index:
unsplash-lite-dataset-api delete-index --index-name unsplash_photos
For backwards compatibility, you can still run:
python -m main_index
which now delegates to the CLI's index command using synonyms.txt located next to the script.
Library usage
from unsplash_lite_dataset_api import (
load_postgres_config,
load_opensearch_config,
create_pg_connection,
create_opensearch_client,
generate_documents,
load_synonyms_from_file,
build_index,
)
pg_cfg = load_postgres_config()
os_cfg = load_opensearch_config()
with create_pg_connection(pg_cfg) as pg_conn:
os_client = create_opensearch_client(os_cfg)
synonyms = load_synonyms_from_file("./synonyms.txt")
build_index(
client=os_client,
conn=pg_conn,
index_name="unsplash_photos",
synonyms=synonyms,
)
For searching:
from unsplash_lite_dataset_api import create_opensearch_client, load_opensearch_config, search_images
client = create_opensearch_client(load_opensearch_config())
results = search_images(
client,
index_name="unsplash_photos",
query_text="blue ocean sunset",
size=10,
from_=20, # For pagination
)
Synonym generation
Use the WordNet helpers to build a synonyms file when you do not already have one:
from pathlib import Path
from unsplash_lite_dataset_api import generate_wordnet_synonyms_file
target = Path("./synonyms.txt")
generate_wordnet_synonyms_file(target)
Ensure the NLTK wordnet and omw-1.4 corpora are installed locally. If they are missing, the helper raises a detailed WordnetInitializationError describing how to fix the environment.
Testing
Run the unit tests with:
pytest
The tests cover the search query builder and synonym loader utilities.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unsplash_lite_dataset_api-0.1.1.tar.gz.
File metadata
- Download URL: unsplash_lite_dataset_api-0.1.1.tar.gz
- Upload date:
- Size: 14.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e1d056b5ad1602a566b6e754cf966941eebeec293108c25e62fb2a41b260603
|
|
| MD5 |
e45634cfd32dc8e233de235ba468c946
|
|
| BLAKE2b-256 |
a79228a8e63006dce4020491247cb3f70422df4117a289dd38fdb67472461326
|
File details
Details for the file unsplash_lite_dataset_api-0.1.1-py3-none-any.whl.
File metadata
- Download URL: unsplash_lite_dataset_api-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9357eab612a00ad576a2b9f2c83812cb697908d02b50359f8dec383d2834c00c
|
|
| MD5 |
2600d59b36667bf771e8d3bad9d8ba68
|
|
| BLAKE2b-256 |
27f7ea25925f8c7ff6d4122bf18524bec3d17c92f12d0bef90c72d2c24d04f7c
|