RAG-based article tag generator using local embeddings and FAISS
Project description
Article Tagger
RAG-based article tag generator with hybrid scoring, active learning, and full agent integration.
Input an article, get the most relevant tags from your tag library — with confidence scores. Supports Chinese (Traditional/Simplified) natively.
100% accuracy on 30 test articles across 611 tags.
Install
# Python (core engine + CLI)
pip install -e .
# npm (agent integration)
npm install article-tagger-sdk # TypeScript SDK
npx article-tagger-mcp # MCP server for Claude Code / Cursor
Quick Start
# 1. Build index from tag library
article-tagger build-index --tags 51標籤庫.json
# 2. Tag an article
article-tagger tag --text "她頭上有角,背後長著翅膀,尾巴尖尖的..."
# 3. Interactive mode
article-tagger interactive
# 4. Pre-load model for faster queries
article-tagger warmup
How It Works
Three-layer hybrid scoring:
- Embedding search — sentence-transformers (multilingual) + FAISS vector similarity
- Keyword boost — exact/partial match across Simplified/Traditional Chinese variants
- Composite patterns — multi-feature detection (e.g., 角+翅膀 → 惡魔娘)
Plus:
- Active Learning — EMA-weighted feedback loop with dedup, converges over time
- Tag hierarchy — child tag matched → ancestors auto-boosted (with cycle detection)
- LLM reranker — optional Claude API second pass for precision
- Text augmentation — S/T Chinese variants + concept expansion in embedding space
Five Interfaces
| Interface | Command | Use Case |
|---|---|---|
| CLI | article-tagger <cmd> |
Human workflow, scripting |
| Agent CLI | atag <cmd> |
Pure JSON, agent chaining |
| REST API | article-tagger serve |
Web apps, microservices |
| MCP Server | article-tagger mcp-serve |
Claude Code, Cursor |
| Web UI | localhost:8000/ui |
Visual dashboard (Gradio) |
CLI Commands
| Command | Description |
|---|---|
build-index |
Build FAISS index from tag file (CSV/JSON) |
tag |
Tag single article (--rerank for LLM reranking) |
tag-batch |
Batch tag a directory |
interactive |
Interactive REPL with live feedback |
watch |
Watch directory, auto-tag new files |
discover |
New — Find new tag candidates from articles |
warmup |
New — Pre-load model + index into memory |
analytics |
Usage stats dashboard |
quality |
Tag library quality report |
visualize |
t-SNE/UMAP embedding visualization |
cooccur |
Tag co-occurrence analysis |
weights |
Active Learning weights |
enrich |
LLM auto-fill tag descriptions/categories |
eval |
Evaluate tagging accuracy (precision, recall, F1, MAP) |
export |
Multi-format export (CSV/Markdown/JSONL/Notion) |
serve |
Start REST API + Web UI |
mcp-serve |
Start MCP Server |
state * |
State management (export/import/incremental sync) |
model * |
Model management (list/switch/compare) |
profile * |
Config profiles (strict/broad/fast/custom) |
daemon * |
Daemon mode (start/stop/status/tag) |
Agent-Native CLI (atag)
Pure JSON output, designed for agent chaining:
# Tag
atag tag "article text..."
atag tag -f article.txt
# Batch
atag batch -d ./articles/
# Feedback loop
atag feedback "標籤名" true
# Pipe-friendly
echo "article" | atag tag | jq '.tags[0].tag_name'
REST API
27 endpoints. Start with article-tagger serve.
| Method | Path | Description |
|---|---|---|
| POST | /api/tag |
Tag single article |
| POST | /api/tag-batch |
Batch tag (up to 100) |
| GET/POST/PUT/DELETE | /api/tags[/{id}] |
Tag CRUD |
| POST | /api/build-index |
Upload tag file, rebuild index |
| POST | /api/feedback |
Submit feedback (with dedup) |
| GET | /api/feedback |
Get feedback + stats |
| POST | /api/feedback/undo |
Undo last feedback |
| DELETE | /api/feedback/{id} |
Delete feedback |
| GET | /api/history |
Tagging history |
| GET | /api/history/search |
Search history |
| DELETE | /api/history/{id} |
Delete history |
| GET | /api/export |
Export history (CSV/JSON) |
| GET | /api/analytics |
Analytics dashboard |
| GET | /api/quality |
Tag quality report |
| GET | /api/cooccurrence |
Co-occurrence pairs |
| GET | /api/cooccurrence/suggest |
Tag suggestions |
| GET | /api/weights |
Active Learning weights |
| GET | /api/profiles |
Config profiles |
| GET | /api/health |
Health check |
MCP Server (24 tools)
For Claude Code, Cursor, or any MCP-compatible agent:
{
"mcpServers": {
"article-tagger": {
"command": "python",
"args": ["-m", "article_tagger.mcp_server"]
}
}
}
Or via npm:
{
"mcpServers": {
"article-tagger": {
"command": "npx",
"args": ["article-tagger-mcp"]
}
}
}
Tools: tag_article, build_index, search_tags, list_tags, create_tag, update_tag, delete_tag, add_feedback, undo_feedback, delete_feedback, get_stats, get_weights, search_history, delete_history, export_history, get_analytics, get_quality_report, suggest_related_tags, discover_tags, export_state, import_state, switch_model, list_models, get_pipeline_info, list_profiles
TypeScript SDK
import { ArticleTaggerClient } from "article-tagger-sdk";
const client = new ArticleTaggerClient({ baseUrl: "http://localhost:8000" });
// Tag
const result = await client.tag("article text...", 5);
console.log(result.tags);
// Feedback
await client.addFeedback("article", "tag_name", true);
// CRUD
await client.createTag({ name: "新標籤", description: "描述", category: "分類" });
await client.updateTag("123", { description: "更新描述" });
await client.deleteTag("123");
// History & Analytics
const history = await client.searchHistory("keyword");
const analytics = await client.getAnalytics(30);
// Co-occurrence
const suggestions = await client.suggestTags(["標籤A", "標籤B"]);
Tag Discovery
Semi-automatic new tag candidate extraction:
# Scan articles for frequently mentioned concepts not in tag library
article-tagger discover ./articles/ --min-freq 3 --top 10
# JSON output for automation
article-tagger discover ./articles/ --json
Reports frequency, confidence score, source files, and similar existing tags.
Daemon Mode
Keep model in memory for millisecond-level tagging:
article-tagger daemon start
article-tagger daemon tag --text "..." # ~10ms vs ~3s cold start
article-tagger daemon stop
State Migration
Transfer everything (index, feedback, history, weights) between machines:
# Full export/import
article-tagger state export -o bundle.tar.gz
article-tagger state import -f bundle.tar.gz
# Incremental sync
article-tagger state export-incr -o incr.tar.gz --since 2026-03-01T00:00:00
article-tagger state merge -f incr.tar.gz
Evaluation
# Create benchmark template
article-tagger eval --create-template benchmark.json
# Run evaluation (precision@k, recall@k, F1, MAP, NDCG)
article-tagger eval -b benchmark.json --top-k 5
Configuration
All settings via TAGGER_ prefixed environment variables:
| Variable | Default | Description |
|---|---|---|
TAGGER_MODEL_NAME |
paraphrase-multilingual-MiniLM-L12-v2 |
Embedding model |
TAGGER_TOP_K_RETURN |
3 |
Default tags returned |
TAGGER_SIMILARITY_THRESHOLD |
0.1 |
Minimum score |
TAGGER_MAX_TEXT_LENGTH |
100000 |
Max article size (chars) |
TAGGER_ENABLE_HIERARCHY |
true |
Tag hierarchy |
TAGGER_ENABLE_SYNONYMS |
true |
Synonym expansion |
TAGGER_ENABLE_RERANKER |
false |
LLM reranker |
TAGGER_ENABLE_CACHE |
true |
Result + embedding cache |
TAGGER_ANTHROPIC_API_KEY |
— | For LLM reranker/enricher |
TAGGER_API_KEY |
— | REST API auth |
Project Structure
src/article_tagger/
├── tagger.py # Core facade — orchestrates all components
├── embedder.py # sentence-transformers wrapper
├── indexer.py # FAISS index (Flat/IVF auto-select)
├── booster.py # Keyword matching + composite patterns
├── text_augmenter.py # Tag text enrichment (S/T Chinese, concept expansion)
├── tag_discovery.py # New tag candidate extraction
├── hierarchy.py # Parent/child resolution + cycle detection
├── active_learning.py # EMA-based feedback weights + dedup
├── reranker.py # LLM reranking (Claude API)
├── cooccurrence.py # Tag co-occurrence graph
├── pipeline.py # Pre/post processor plugins
├── cache.py # LRU cache + persistent embedding cache
├── store.py # Feedback + history (JSON)
├── tag_loader.py # CSV/JSON tag loader
├── tag_quality.py # Duplicate/orphan/isolated detection
├── tag_enricher.py # LLM tag library enrichment
├── evaluation.py # Benchmark framework
├── analytics.py # Usage analytics
├── profiles.py # Config profiles
├── exporter.py # Multi-format export
├── visualizer.py # t-SNE/UMAP visualization
├── state_manager.py # State export/import
├── config.py # pydantic-settings
├── models.py # Data models
├── cli.py # Typer CLI (26 commands)
├── agent_cli.py # Agent CLI — pure JSON (atag)
├── api.py # FastAPI (27 endpoints)
├── mcp_server.py # MCP Server (24 tools)
├── ui.py # Gradio Web UI (6 tabs)
├── repl.py # Interactive REPL
├── daemon.py # Unix socket daemon
├── watcher.py # Directory watch mode
└── middleware.py # API middleware
packages/
├── sdk/ # TypeScript SDK (17 methods)
├── mcp-server/ # TypeScript MCP Server (11 proxied tools)
└── tool-schemas/ # OpenAI + Anthropic tool formats
Testing
# 206 tests covering all modules
pytest tests/ -v
# Skip model-loading tests (faster, avoids torch issues on some platforms)
SKIP_MODEL_TESTS=1 pytest tests/ -q
Docker
docker compose up --build
# API at localhost:8000, Web UI at localhost:8000/ui
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file article_tagger-0.4.0.tar.gz.
File metadata
- Download URL: article_tagger-0.4.0.tar.gz
- Upload date:
- Size: 354.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f88db0bb69aaeb4676bb559144911bfbce998ade56bf183ee766786058e5cd25
|
|
| MD5 |
175e5f20654727d7b15df8ae776d6d04
|
|
| BLAKE2b-256 |
8c3406b9cc3781eeb4d48118210a893638bebbabe4b517f3390f5fcb9a556be8
|
File details
Details for the file article_tagger-0.4.0-py3-none-any.whl.
File metadata
- Download URL: article_tagger-0.4.0-py3-none-any.whl
- Upload date:
- Size: 83.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ddfc22637b15cdc483301e5cd1f379646ab0ff1f579b93c2da9b07900c9ed71
|
|
| MD5 |
1f317a6b0b18dd5b3a1654ecb82adbf1
|
|
| BLAKE2b-256 |
5b7fa2dc813164c5238d2d658af6f39afc24c28612c2daf7ad89cb13b782b892
|