Skip to main content

RAG-based article tag generator using local embeddings and FAISS

Project description

Article Tagger

RAG-based article tag generator with hybrid scoring, active learning, and full agent integration.

Input an article, get the most relevant tags from your tag library — with confidence scores. Supports Chinese (Traditional/Simplified) natively.

100% accuracy on 30 test articles across 611 tags.

Install

# Core (CLI only — lightweight, ~500MB with model)
pip install article-tagger

# Full (API server + Web UI + export + visualization)
pip install article-tagger[all]

# Pick what you need
pip install article-tagger[api]       # REST API + Web UI server
pip install article-tagger[llm]       # Claude API reranker
pip install article-tagger[mcp]       # MCP server
pip install article-tagger[viz]       # t-SNE/UMAP visualization

# npm (agent integration)
npm install article-tagger-sdk        # TypeScript SDK
npx article-tagger-mcp                # MCP server for Claude Code / Cursor

Quick Start

# Zero-config demo (creates sample tags, builds index, tags an article)
bash quickstart.sh

Or step by step:

# 1. Build index from tag library
article-tagger build-index --tags 51標籤庫.json

# 2. Tag an article
article-tagger tag --text "她頭上有角,背後長著翅膀,尾巴尖尖的..."

# 3. Interactive mode
article-tagger interactive

# 4. Pre-load model for faster queries
article-tagger warmup

How It Works

Three-layer hybrid scoring:

  1. Embedding search — sentence-transformers (multilingual) + FAISS vector similarity
  2. Keyword boost — exact/partial match across Simplified/Traditional Chinese variants
  3. Composite patterns — multi-feature detection (e.g., 角+翅膀 → 惡魔娘)

Plus:

  • Active Learning — EMA-weighted feedback loop with dedup, converges over time
  • Tag hierarchy — child tag matched → ancestors auto-boosted (with cycle detection)
  • LLM reranker — optional Claude API second pass for precision
  • Text augmentation — S/T Chinese variants + concept expansion in embedding space

Five Interfaces

Interface Command Use Case
CLI article-tagger <cmd> Human workflow, scripting
Agent CLI atag <cmd> Pure JSON, agent chaining
REST API article-tagger serve Web apps, microservices
MCP Server article-tagger mcp-serve Claude Code, Cursor
Web UI localhost:8000/ui Visual dashboard (Gradio)

CLI Commands

Command Description
build-index Build FAISS index from tag file (CSV/JSON)
tag Tag single article (--rerank for LLM reranking)
tag-batch Batch tag a directory
interactive Interactive REPL with live feedback
watch Watch directory, auto-tag new files
discover New — Find new tag candidates from articles
warmup New — Pre-load model + index into memory
analytics Usage stats dashboard
quality Tag library quality report
visualize t-SNE/UMAP embedding visualization
cooccur Tag co-occurrence analysis
weights Active Learning weights
enrich LLM auto-fill tag descriptions/categories
eval Evaluate tagging accuracy (precision, recall, F1, MAP)
export Multi-format export (CSV/Markdown/JSONL/Notion)
serve Start REST API + Web UI
mcp-serve Start MCP Server
state * State management (export/import/incremental sync)
model * Model management (list/switch/compare)
profile * Config profiles (strict/broad/fast/custom)
daemon * Daemon mode (start/stop/status/tag)

Agent-Native CLI (atag)

Pure JSON output, designed for agent chaining:

# Tag
atag tag "article text..."
atag tag -f article.txt

# Batch
atag batch -d ./articles/

# Feedback loop
atag feedback "標籤名" true

# Pipe-friendly
echo "article" | atag tag | jq '.tags[0].tag_name'

REST API

27 endpoints. Start with article-tagger serve.

Method Path Description
POST /api/tag Tag single article
POST /api/tag-batch Batch tag (up to 100)
GET/POST/PUT/DELETE /api/tags[/{id}] Tag CRUD
POST /api/build-index Upload tag file, rebuild index
POST /api/feedback Submit feedback (with dedup)
GET /api/feedback Get feedback + stats
POST /api/feedback/undo Undo last feedback
DELETE /api/feedback/{id} Delete feedback
GET /api/history Tagging history
GET /api/history/search Search history
DELETE /api/history/{id} Delete history
GET /api/export Export history (CSV/JSON)
GET /api/analytics Analytics dashboard
GET /api/quality Tag quality report
GET /api/cooccurrence Co-occurrence pairs
GET /api/cooccurrence/suggest Tag suggestions
GET /api/weights Active Learning weights
GET /api/profiles Config profiles
GET /api/health Health check

MCP Server (24 tools)

For Claude Code, Cursor, or any MCP-compatible agent:

{
  "mcpServers": {
    "article-tagger": {
      "command": "python",
      "args": ["-m", "article_tagger.mcp_server"]
    }
  }
}

Or via npm:

{
  "mcpServers": {
    "article-tagger": {
      "command": "npx",
      "args": ["article-tagger-mcp"]
    }
  }
}

Tools: tag_article, build_index, search_tags, list_tags, create_tag, update_tag, delete_tag, add_feedback, undo_feedback, delete_feedback, get_stats, get_weights, search_history, delete_history, export_history, get_analytics, get_quality_report, suggest_related_tags, discover_tags, export_state, import_state, switch_model, list_models, get_pipeline_info, list_profiles

TypeScript SDK

import { ArticleTaggerClient } from "article-tagger-sdk";

const client = new ArticleTaggerClient({ baseUrl: "http://localhost:8000" });

// Tag
const result = await client.tag("article text...", 5);
console.log(result.tags);

// Feedback
await client.addFeedback("article", "tag_name", true);

// CRUD
await client.createTag({ name: "新標籤", description: "描述", category: "分類" });
await client.updateTag("123", { description: "更新描述" });
await client.deleteTag("123");

// History & Analytics
const history = await client.searchHistory("keyword");
const analytics = await client.getAnalytics(30);

// Co-occurrence
const suggestions = await client.suggestTags(["標籤A", "標籤B"]);

Tag Discovery

Semi-automatic new tag candidate extraction:

# Scan articles for frequently mentioned concepts not in tag library
article-tagger discover ./articles/ --min-freq 3 --top 10

# JSON output for automation
article-tagger discover ./articles/ --json

Reports frequency, confidence score, source files, and similar existing tags.

Daemon Mode

Keep model in memory for millisecond-level tagging:

article-tagger daemon start
article-tagger daemon tag --text "..."   # ~10ms vs ~3s cold start
article-tagger daemon stop

State Migration

Transfer everything (index, feedback, history, weights) between machines:

# Full export/import
article-tagger state export -o bundle.tar.gz
article-tagger state import -f bundle.tar.gz

# Incremental sync
article-tagger state export-incr -o incr.tar.gz --since 2026-03-01T00:00:00
article-tagger state merge -f incr.tar.gz

Evaluation

# Create benchmark template
article-tagger eval --create-template benchmark.json

# Run evaluation (precision@k, recall@k, F1, MAP, NDCG)
article-tagger eval -b benchmark.json --top-k 5

Configuration

All settings via TAGGER_ prefixed environment variables:

Variable Default Description
TAGGER_MODEL_NAME paraphrase-multilingual-MiniLM-L12-v2 Embedding model
TAGGER_TOP_K_RETURN 3 Default tags returned
TAGGER_SIMILARITY_THRESHOLD 0.1 Minimum score
TAGGER_MAX_TEXT_LENGTH 100000 Max article size (chars)
TAGGER_ENABLE_HIERARCHY true Tag hierarchy
TAGGER_ENABLE_SYNONYMS true Synonym expansion
TAGGER_ENABLE_RERANKER false LLM reranker
TAGGER_ENABLE_CACHE true Result + embedding cache
TAGGER_ANTHROPIC_API_KEY For LLM reranker/enricher
TAGGER_API_KEY REST API auth

Project Structure

src/article_tagger/
├── tagger.py           # Core facade — orchestrates all components
├── embedder.py         # sentence-transformers wrapper
├── indexer.py          # FAISS index (Flat/IVF auto-select)
├── booster.py          # Keyword matching + composite patterns
├── text_augmenter.py   # Tag text enrichment (S/T Chinese, concept expansion)
├── tag_discovery.py    # New tag candidate extraction
├── hierarchy.py        # Parent/child resolution + cycle detection
├── active_learning.py  # EMA-based feedback weights + dedup
├── reranker.py         # LLM reranking (Claude API)
├── cooccurrence.py     # Tag co-occurrence graph
├── pipeline.py         # Pre/post processor plugins
├── cache.py            # LRU cache + persistent embedding cache
├── store.py            # Feedback + history (JSON)
├── tag_loader.py       # CSV/JSON tag loader
├── tag_quality.py      # Duplicate/orphan/isolated detection
├── tag_enricher.py     # LLM tag library enrichment
├── evaluation.py       # Benchmark framework
├── analytics.py        # Usage analytics
├── profiles.py         # Config profiles
├── exporter.py         # Multi-format export
├── visualizer.py       # t-SNE/UMAP visualization
├── state_manager.py    # State export/import
├── config.py           # pydantic-settings
├── models.py           # Data models
├── cli.py              # Typer CLI (26 commands)
├── agent_cli.py        # Agent CLI — pure JSON (atag)
├── api.py              # FastAPI (27 endpoints)
├── mcp_server.py       # MCP Server (24 tools)
├── ui.py               # Gradio Web UI (6 tabs)
├── repl.py             # Interactive REPL
├── daemon.py           # Unix socket daemon
├── watcher.py          # Directory watch mode
└── middleware.py        # API middleware
packages/
├── sdk/                # TypeScript SDK (17 methods)
├── mcp-server/         # TypeScript MCP Server (11 proxied tools)
└── tool-schemas/       # OpenAI + Anthropic tool formats

Testing

# 206 tests covering all modules
pytest tests/ -v

# Skip model-loading tests (faster, avoids torch issues on some platforms)
SKIP_MODEL_TESTS=1 pytest tests/ -q

Docker

docker compose up --build
# API at localhost:8000, Web UI at localhost:8000/ui

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_tagger-0.4.1.tar.gz (358.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

article_tagger-0.4.1-py3-none-any.whl (85.9 kB view details)

Uploaded Python 3

File details

Details for the file article_tagger-0.4.1.tar.gz.

File metadata

  • Download URL: article_tagger-0.4.1.tar.gz
  • Upload date:
  • Size: 358.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for article_tagger-0.4.1.tar.gz
Algorithm Hash digest
SHA256 4d30240279fe1e28f8acf0a91b1c79aeee1b83f351325c084702dcff230c2683
MD5 c1d11ea6d32e3f59f1f554170376395f
BLAKE2b-256 f7b8ac65760eb6683db7321783ff8ea362c3b9d424dc63f1e6073d8856f1030c

See more details on using hashes here.

File details

Details for the file article_tagger-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: article_tagger-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 85.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for article_tagger-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 525c041752d7144be90811be19010db409ec52a635dd0defbd9491a471b66f21
MD5 0ad5520b71cef94c6767e7de59f0d79f
BLAKE2b-256 048a130e3b5e19db6f9fb72dbaa8699a3eb47d7c24083771d54a1e66c62fca69

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page