Local IAB Content Taxonomy 2.x -> 3.0 mapper with vectors, SCD, OpenRTB/VAST exporters.
Project description
IAB Content Taxonomy Mapper (Local CLI)
Map IAB Content Taxonomy 2.x labels/codes to IAB 3.0 locally with a deterministic→fuzzy→(optional) local-embeddings pipeline. Outputs are IAB-3.0–compatible IDs suitable for OpenRTB/VAST, with optional vector attributes (Channel, Type, Format, Language, Source, Environment) and SCD awareness.
No external APIs. Runs fully local. LLMs are not required. You can enable local embeddings for tougher matches.
✨ Features
- Deterministic alias/exact matching → fuzzy string matching → optional local embeddings (Sentence-Transformers) for near-misses
- Emits IAB 3.0 IDs (not just labels) and configurable
cattaxfor OpenRTB conformance - Multi-category output per input; vector attributes support
- SCD (Sensitive Content) flag visibility and optional exclusion (
--drop-scd) - Exports CSV or JSON; includes OpenRTB and VAST CONTENTCAT helpers
- Local-only, reproducible, versioned catalogs
🔧 Install
1) Clone / unpack
unzip iab-mapper.zip && cd iab-mapper
2) Python env & install
python -m venv .venv && source .venv/bin/activate
pip install -e .
# Optional (enable local embeddings / KNN search)
pip install -e ".[emb]"
If you need fully offline installs, pre-bundle the Sentence-Transformers model in your image/host and point to it via
--emb-model(local path).
📁 Project Layout
iab-mapper/
pyproject.toml
sample_2x_codes.csv
iab_mapper/
__init__.py
cli.py
pipeline.py
matching.py
normalize.py
embeddings.py
io_utils.py
data/
iab_2x.json
iab_3x.json
synonyms_2x.json
synonyms_3x.json
vectors_channel.json
vectors_type.json
vectors_format.json
vectors_language.json
vectors_source.json
vectors_environment.json
Replace the stub data/*.json with your full IAB catalogs (include id, label, path, and scd on 3.0 nodes).
🚀 Quick Start
# map the sample CSV using fuzzy matching only
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json
# enable local embeddings (improves recall on free-text labels)
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings
The output contains for each input row:
out_ids→ IAB 3.0 IDs (topics + any vector IDs)openrtb→{"content":{"cat":[...],"cattax":"<enum>"}}(configurable via--cattax)vast_contentcat→"id1","id2",...- Topic confidences, sources (`"exact"/"fuzzy"/"embed"/"override"), SCD flags, and chosen vectors.
📥 Input Formats
CSV
- Required columns:
label - Optional columns:
code(2.x),channel,type,format,language,source,environment
Example:
code,label,channel,type,format,language,source,environment
1-4,Sports,editorial,article,video,en,professional,ctv
, Cooking how-to ,editorial,article,video,en,professional,web
JSON
- List of objects with the same fields as CSV.
📤 Output Formats
CSV
- Includes compact JSON strings for complex fields (e.g.,
topic_ids,openrtb).
JSON
- List of records. Example snippet:
{
"in_code": "2-12",
"in_label": "Food & Drink",
"out_ids": ["3-5-2", "1026", "1068"],
"out_labels": ["Food & Drink > Cooking"],
"topic_ids": ["3-5-2"],
"topic_confidence": [0.89],
"topic_sources": ["fuzzy"],
"topic_scd": [false],
"vectors": {"channel":"editorial","type":"article","format":"video","language":"en","source":"professional","environment":"ctv"},
"cattax": "2",
"openrtb": {"content":{"cat":["3-5-2","1026","1068"],"cattax":"2"}},
"vast_contentcat": ""3-5-2","1026","1068""
}
⚙️ Useful Flags
# thresholds
--fuzzy-cut 0.92 # 0..1 (higher = stricter)
--use-embeddings # enable local embeddings
--emb-model all-MiniLM-L6-v2
--emb-cut 0.80 # cosine similarity cut
--max-topics 3 # max topic categories per row
--drop-scd # exclude SCD nodes from results
--cattax 2 # set OpenRTB content.cattax enum for Content Taxonomy
--overrides overrides.json# JSON overrides applied before matching
--unmapped-out misses.json# write rows with no topic_ids to file
🧩 Vectors (Orthogonal Attributes)
Pass via columns or pre-fill in your CSV:
- Channel (
vectors_channel.json): e.g.,editorial,ugc - Type (
vectors_type.json): e.g.,article,podcast,livestream - Format (
vectors_format.json): e.g.,video,text,audio - Language (
vectors_language.json): e.g.,en,es,de - Source (
vectors_source.json): e.g.,professional,brand,news - Environment (
vectors_environment.json): e.g.,ctv,web,app
Each value maps to a stable IAB 3.0 ID that is appended to the cat array.
✅ IAB 3.0 Conformance Notes
- Emits IDs for
content.catand sets"cattax":"<enum>". - Supports multiple categories per content (topic IDs + vectors).
- Strict ID validation: only IDs present in your 3.0 catalog are emitted.
- SCD-aware: show SCD flags and optionally exclude (
--drop-scd).
This tool is not affiliated with IAB. It is an independent utility for compatibility with IAB Content Taxonomy.
🔬 Evaluation (recommended)
Create a small gold set for your domain and run periodic checks:
# (pseudo) compare mapped.json to gold.json for accuracy & unmapped rates
python scripts/eval.py mapped.json gold.json
Gate releases on accuracy deltas so behavior stays stable for audits.
🛠️ Updating Catalogs
Replace the stub JSONs in iab_mapper/data/ with your official datasets:
iab_2x.json→ includecode,labeliab_3x.json→ includeid,label,path[],scdsynonyms_*.json→ org-specific aliasesvectors_*.json→ official vector catalogs mapping values to stable 3.0 IDs
Commit with a version bump and note taxonomy_version in your release notes.
🧯 Troubleshooting
- No matches: lower
--fuzzy-cutor enable--use-embeddings. - Weird matches: raise thresholds; add synonyms into
synonyms_*.json. - Offline: pre-bundle ST model; set
--emb-modelto a local folder path. - CSV issues: ensure UTF-8 and header row (
labelrequired). - Unmapped: inspect
--unmapped-outand add overrides/synonyms as needed.
📦 Example Commands
# Strict fuzzy only
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.csv --fuzzy-cut 0.95
# Embeddings on, drop SCD, max 2 topics, custom cattax, collect unmapped
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings --drop-scd --max-topics 2 --cattax 2 --unmapped-out misses.json
📜 License
TBD by Mixpeek. Include IAB attribution in your deployed UI/footer:
“IAB is a registered trademark of the Interactive Advertising Bureau. This tool is an independent utility built by Mixpeek for interoperability with IAB Content Taxonomy standards.”
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iab_mapper-0.3.0.tar.gz.
File metadata
- Download URL: iab_mapper-0.3.0.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4013337e8e12dee47f028697019feb52854b13a29a00ca248ed9c840ba5b368
|
|
| MD5 |
6ff14a478340abe56729824bd5f4b44e
|
|
| BLAKE2b-256 |
733b5611da813ebd56b569aa1d717c5740f58b08ec773d20b6a6c72a34ce8cac
|
File details
Details for the file iab_mapper-0.3.0-py3-none-any.whl.
File metadata
- Download URL: iab_mapper-0.3.0-py3-none-any.whl
- Upload date:
- Size: 15.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a552412129c67e50ff0d72e8c83ff1b6bdbe35b6ed641d534a0d3fefefb188c9
|
|
| MD5 |
53491f93991f4f00075d91dccb14135b
|
|
| BLAKE2b-256 |
72b73427a231cca8ccfdc31ca5eb0cc01dab27445366dc05c865f6fa5bf78773
|