Semantic related-article indexer and generator for static blogs.
Project description
Blog similarity index
blogsimi builds semantically related article recommendations for static blogs.
It walks your rendered site, turns the content into embeddings, stores them in
PostgreSQL via pgvector, and then exports
a recommendation JSON that you can drop straight into Jekyll or any static site
generator. The design and rationale are described in a blog post.
Features
- Extracts rendered HTML, strips boilerplate (extracts only specific content ids or classes) and chunks content into embedding-friendly blocks.
- Supports Ollama (default) and OpenAI embedding providers; switching providers is a config change and requires a rebuild of the index.
- Persists embeddings, metadata, and recommendations in PostgreSQL with pgvector distance queries. Pages are only re-indexed when the content has changed.
- A very simple CLI
Installation
The project can be installed from PyPi:
pip install blogsimi # from PyPI once published
# or
pip install . # from a local checkout
Configuration
Configuration lives in ~/.config/blogsimilarity.cfg by default (overridable
with --config). The file is JSON and mirrors the defaults baked into the package:
{
"site_root": "_site",
"data_out": "_data/related.json",
"exclude_globs": ["tags/**", "drafts/**", "private/**", "admin/**"],
"content_ids": ["content"],
"neighbors": {
"ksample": 16,
"k": 8,
"temperature": 0.7,
"pin_top": true,
"seed": null,
"seealso": 4
},
"chunk": {
"max_tokens": 800,
"overlap_tokens": 100
},
"embedding": {
"provider": "ollama",
"model": "nomic-embed-text",
"ollama_url": "http://127.0.0.1:11434/api/embeddings",
"openai_api_base": "https://api.openai.com/v1/embeddings",
"openai_api_key_env": "OPENAI_API_KEY"
},
"db": {
"host": "127.0.0.1",
"port": 5432,
"user": "blog",
"password": "blog",
"dbname": "blog"
},
"strip_image_hosts": null
}
Set OPENAI_API_KEY (or the environment variable you configure) when using the
OpenAI provider. Ensure your PostgreSQL instance has the pgvector extension
installed. You have to manually enable the extension as a superuser in the database:
-- as a PostgreSQL superuser
CREATE ROLE blog LOGIN PASSWORD 'blog';
CREATE DATABASE blog OWNER blog;
\c blog
CREATE EXTENSION IF NOT EXISTS vector; -- requires superuser or appropriate privileges
CLI Usage
All functionality is exposed via the blogsimi command:
blogsimi initdb– create the required tables and infer the embedding dimension from your provider. Note that theVECTORextension has to be already enabled.blogsimi resetdb– drop and recreate the tables (useful when switching embedding dimensions).blogsimi index [--page PATH]– walk the rendered site (defaults tosite_root), compute embeddings where content changed, and persist them.blogsimi genrel [--out PATH]– produce the recommendation JSON ready for your static site.
A typical run after rendering your blog might look like:
blogsimi index --page _site
blogsimi genrel --out _data/related.json
Development
The package is available via PyPi and can be installed via
pip install blogsimi
The repository uses a src/ layout. For local development, install in editable mode:
pip install -e .
PYTHONPATH=src python -m blogsimi.cli --help
License
This project is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file blogsimi-0.0.6.tar.gz.
File metadata
- Download URL: blogsimi-0.0.6.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4df19f7e3dc90a51843d891c0911eb90687e4d7cac276bc98c01357a245fa86e
|
|
| MD5 |
c26950fcd1ecef1da0738396f00b4fd1
|
|
| BLAKE2b-256 |
65a2b228e3d221a81500a72ba8618eac6e623181f02d8210cdf235cc07484c9f
|
File details
Details for the file blogsimi-0.0.6-py3-none-any.whl.
File metadata
- Download URL: blogsimi-0.0.6-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f3d3172b6ad58251d1fbd4668e9629b4cefdb8a57c08cb1c29376d8d864c862
|
|
| MD5 |
ca6ad9ad93118b58c0612d139410a79b
|
|
| BLAKE2b-256 |
a3e30c6c304106b39fe6dcea7a17b0f041abce91ccb5f5e491c2a7bd912619e1
|