Skip to main content

Semantic related-article indexer and generator for static blogs.

Project description

Blog similarity index

blogsimi builds semantically related article recommendations for static blogs. It walks your rendered site, turns the content into embeddings, stores them in PostgreSQL via pgvector, and then exports a recommendation JSON that you can drop straight into Jekyll or any static site generator. The design and rationale are described in a blog post.

Features

  • Extracts rendered HTML, strips boilerplate (extracts only specific content ids or classes) and chunks content into embedding-friendly blocks.
  • Supports Ollama (default) and OpenAI embedding providers; switching providers is a config change and requires a rebuild of the index.
  • Persists embeddings, metadata, and recommendations in PostgreSQL with pgvector distance queries. Pages are only re-indexed when the content has changed.
  • A very simple CLI

Installation

The project can be installed from PyPi:

pip install blogsimi                # from PyPI once published
# or
pip install .                       # from a local checkout

Configuration

Configuration lives in ~/.config/blogsimilarity.cfg by default (overridable with --config). The file is JSON and mirrors the defaults baked into the package:

{
  "site_root": "_site",
  "data_out": "_data/related.json",
  "exclude_globs": ["tags/**", "drafts/**", "private/**", "admin/**"],
  "content_ids": ["content"],
  "neighbors": {
    "ksample": 16,
    "k": 8,
    "temperature": 0.7,
    "pin_top": true,
    "seed": null,
    "seealso": 4
  },
  "chunk": {
    "max_tokens": 800,
    "overlap_tokens": 100
  },
  "embedding": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "ollama_url": "http://127.0.0.1:11434/api/embeddings",
    "openai_api_base": "https://api.openai.com/v1/embeddings",
    "openai_api_key_env": "OPENAI_API_KEY"
  },
  "db": {
    "host": "127.0.0.1",
    "port": 5432,
    "user": "blog",
    "password": "blog",
    "dbname": "blog"
  },
  "strip_image_hosts": null
}

Set OPENAI_API_KEY (or the environment variable you configure) when using the OpenAI provider. Ensure your PostgreSQL instance has the pgvector extension installed. You have to manually enable the extension as a superuser in the database:

-- as a PostgreSQL superuser
CREATE ROLE blog LOGIN PASSWORD 'blog';
CREATE DATABASE blog OWNER blog;
\c blog
CREATE EXTENSION IF NOT EXISTS vector;  -- requires superuser or appropriate privileges

CLI Usage

All functionality is exposed via the blogsimi command:

  • blogsimi initdb – create the required tables and infer the embedding dimension from your provider. Note that the VECTOR extension has to be already enabled.
  • blogsimi resetdb – drop and recreate the tables (useful when switching embedding dimensions).
  • blogsimi index [--page PATH] – walk the rendered site (defaults to site_root), compute embeddings where content changed, and persist them.
  • blogsimi genrel [--out PATH] – produce the recommendation JSON ready for your static site.

A typical run after rendering your blog might look like:

blogsimi index --page _site
blogsimi genrel --out _data/related.json

Development

The package is available via PyPi and can be installed via

pip install blogsimi

The repository uses a src/ layout. For local development, install in editable mode:

pip install -e .
PYTHONPATH=src python -m blogsimi.cli --help

License

This project is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blogsimi-0.0.7.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blogsimi-0.0.7-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file blogsimi-0.0.7.tar.gz.

File metadata

  • Download URL: blogsimi-0.0.7.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for blogsimi-0.0.7.tar.gz
Algorithm Hash digest
SHA256 715ef8751efc9e7ed6786050f35fdf4709c801600fe316d75780a25354147825
MD5 7632228d48897442eaa0bd2cb15fd920
BLAKE2b-256 825e871fbd854d61b66dd2804af1238430b65e074ebdb13dcbb0319e8ca6cffa

See more details on using hashes here.

File details

Details for the file blogsimi-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: blogsimi-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for blogsimi-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 c27450c60bdfe6685419bdd0a30c9bc4cb06816b4c09feb161f187f0f7f51b78
MD5 30b705f5c2290a472cca780a48e6093d
BLAKE2b-256 2679438e616896c3e9f20c6e2f300b465c8a0d23812a7a1a1f0c9a266722ddea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page