Skip to main content

Semantic related-article indexer and generator for static blogs.

Project description

Blog similarity index

blogsimi builds semantically related article recommendations for static blogs. It walks your rendered site, turns the content into embeddings, stores them in PostgreSQL via pgvector, and then exports a recommendation JSON that you can drop straight into Jekyll or any static site generator. The design and rationale are described in a blog post.

Features

  • Extracts rendered HTML, strips boilerplate (extracts only specific content ids or classes) and chunks content into embedding-friendly blocks.
  • Supports Ollama (default) and OpenAI embedding providers; switching providers is a config change and requires a rebuild of the index.
  • Persists embeddings, metadata, and recommendations in PostgreSQL with pgvector distance queries. Pages are only re-indexed when the content has changed.
  • A very simple CLI

Installation

The project can be installed from PyPi:

pip install blogsimi                # from PyPI once published
# or
pip install .                       # from a local checkout

Configuration

Configuration lives in ~/.config/blogsimilarity.cfg by default (overridable with --config). The file is JSON and mirrors the defaults baked into the package:

{
  "site_root": "_site",
  "data_out": "_data/related.json",
  "exclude_globs": ["tags/**", "drafts/**", "private/**", "admin/**"],
  "content_ids": ["content"],
  "neighbors": {
    "ksample": 16,
    "k": 8,
    "temperature": 0.7,
    "pin_top": true,
    "seed": null,
    "seealso": 4
  },
  "chunk": {
    "max_tokens": 800,
    "overlap_tokens": 100
  },
  "embedding": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "ollama_url": "http://127.0.0.1:11434/api/embeddings",
    "openai_api_base": "https://api.openai.com/v1/embeddings",
    "openai_api_key_env": "OPENAI_API_KEY"
  },
  "db": {
    "host": "127.0.0.1",
    "port": 5432,
    "user": "blog",
    "password": "blog",
    "dbname": "blog"
  },
  "strip_image_hosts": null
}

Set OPENAI_API_KEY (or the environment variable you configure) when using the OpenAI provider. Ensure your PostgreSQL instance has the pgvector extension installed. You have to manually enable the extension as a superuser in the database:

-- as a PostgreSQL superuser
CREATE ROLE blog LOGIN PASSWORD 'blog';
CREATE DATABASE blog OWNER blog;
\c blog
CREATE EXTENSION IF NOT EXISTS vector;  -- requires superuser or appropriate privileges

CLI Usage

All functionality is exposed via the blogsimi command:

  • blogsimi initdb – create the required tables and infer the embedding dimension from your provider. Note that the VECTOR extension has to be already enabled.
  • blogsimi resetdb – drop and recreate the tables (useful when switching embedding dimensions).
  • blogsimi index [--page PATH] – walk the rendered site (defaults to site_root), compute embeddings where content changed, and persist them.
  • blogsimi genrel [--out PATH] – produce the recommendation JSON ready for your static site.

A typical run after rendering your blog might look like:

blogsimi index --page _site
blogsimi genrel --out _data/related.json

Development

The package is available via PyPi and can be installed via

pip install blogsimi

The repository uses a src/ layout. For local development, install in editable mode:

pip install -e .
PYTHONPATH=src python -m blogsimi.cli --help

License

This project is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blogsimi-0.0.6.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blogsimi-0.0.6-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file blogsimi-0.0.6.tar.gz.

File metadata

  • Download URL: blogsimi-0.0.6.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for blogsimi-0.0.6.tar.gz
Algorithm Hash digest
SHA256 4df19f7e3dc90a51843d891c0911eb90687e4d7cac276bc98c01357a245fa86e
MD5 c26950fcd1ecef1da0738396f00b4fd1
BLAKE2b-256 65a2b228e3d221a81500a72ba8618eac6e623181f02d8210cdf235cc07484c9f

See more details on using hashes here.

File details

Details for the file blogsimi-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: blogsimi-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for blogsimi-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9f3d3172b6ad58251d1fbd4668e9629b4cefdb8a57c08cb1c29376d8d864c862
MD5 ca6ad9ad93118b58c0612d139410a79b
BLAKE2b-256 a3e30c6c304106b39fe6dcea7a17b0f041abce91ccb5f5e491c2a7bd912619e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page