Skip to main content

Local IAB Content Taxonomy 2.x -> 3.0 mapper with vectors, SCD, OpenRTB/VAST exporters.

Project description

IAB Content Taxonomy Mapper (Local CLI)

Map IAB Content Taxonomy 2.x labels/codes to IAB 3.0 locally with a deterministic→fuzzy→(optional) local-embeddings pipeline. Outputs are IAB-3.0–compatible IDs suitable for OpenRTB/VAST, with optional vector attributes (Channel, Type, Format, Language, Source, Environment) and SCD awareness.

No external APIs. Runs fully local. LLMs are not required. You can enable local embeddings for tougher matches.


✨ Features

  • Deterministic alias/exact matching → fuzzy string matching → optional local embeddings (Sentence-Transformers) for near-misses
  • Emits IAB 3.0 IDs (not just labels) and configurable cattax for OpenRTB conformance
  • Multi-category output per input; vector attributes support
  • SCD (Sensitive Content) flag visibility and optional exclusion (--drop-scd)
  • Exports CSV or JSON; includes OpenRTB and VAST CONTENTCAT helpers
  • Local-only, reproducible, versioned catalogs

🔧 Install

1) Clone / unpack

unzip iab-mapper.zip && cd iab-mapper

2) Python env & install

python -m venv .venv && source .venv/bin/activate
pip install -e .
# Optional (enable local embeddings / KNN search)
pip install -e ".[emb]"

If you need fully offline installs, pre-bundle the Sentence-Transformers model in your image/host and point to it via --emb-model (local path).


📁 Project Layout

iab-mapper/
  pyproject.toml
  sample_2x_codes.csv
  iab_mapper/
    __init__.py
    cli.py
    pipeline.py
    matching.py
    normalize.py
    embeddings.py
    io_utils.py
    data/
      iab_2x.json
      iab_3x.json
      synonyms_2x.json
      synonyms_3x.json
      vectors_channel.json
      vectors_type.json
      vectors_format.json
      vectors_language.json
      vectors_source.json
      vectors_environment.json

Replace the stub data/*.json with your full IAB catalogs (include id, label, path, and scd on 3.0 nodes).


🚀 Quick Start

# map the sample CSV using fuzzy matching only
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json

# enable local embeddings (improves recall on free-text labels)
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings

The output contains for each input row:

  • out_idsIAB 3.0 IDs (topics + any vector IDs)
  • openrtb{"content":{"cat":[...],"cattax":"<enum>"}} (configurable via --cattax)
  • vast_contentcat"id1","id2",...
  • Topic confidences, sources (`"exact"/"fuzzy"/"embed"/"override"), SCD flags, and chosen vectors.

📥 Input Formats

CSV

  • Required columns: label
  • Optional columns: code (2.x), channel, type, format, language, source, environment

Example:

code,label,channel,type,format,language,source,environment
1-4,Sports,editorial,article,video,en,professional,ctv
, Cooking how-to ,editorial,article,video,en,professional,web

JSON

  • List of objects with the same fields as CSV.

📤 Output Formats

CSV

  • Includes compact JSON strings for complex fields (e.g., topic_ids, openrtb).

JSON

  • List of records. Example snippet:
{
  "in_code": "2-12",
  "in_label": "Food & Drink",
  "out_ids": ["3-5-2", "1026", "1068"],
  "out_labels": ["Food & Drink > Cooking"],
  "topic_ids": ["3-5-2"],
  "topic_confidence": [0.89],
  "topic_sources": ["fuzzy"],
  "topic_scd": [false],
  "vectors": {"channel":"editorial","type":"article","format":"video","language":"en","source":"professional","environment":"ctv"},
  "cattax": "2",
  "openrtb": {"content":{"cat":["3-5-2","1026","1068"],"cattax":"2"}},
  "vast_contentcat": ""3-5-2","1026","1068""
}

⚙️ Useful Flags

# thresholds
--fuzzy-cut 0.92          # 0..1 (higher = stricter)
--use-embeddings          # enable local embeddings
--emb-model all-MiniLM-L6-v2
--emb-cut 0.80            # cosine similarity cut
--max-topics 3            # max topic categories per row
--drop-scd                # exclude SCD nodes from results
--cattax 2                # set OpenRTB content.cattax enum for Content Taxonomy
--overrides overrides.json# JSON overrides applied before matching
--unmapped-out misses.json# write rows with no topic_ids to file

🧩 Vectors (Orthogonal Attributes)

Pass via columns or pre-fill in your CSV:

  • Channel (vectors_channel.json): e.g., editorial, ugc
  • Type (vectors_type.json): e.g., article, podcast, livestream
  • Format (vectors_format.json): e.g., video, text, audio
  • Language (vectors_language.json): e.g., en, es, de
  • Source (vectors_source.json): e.g., professional, brand, news
  • Environment (vectors_environment.json): e.g., ctv, web, app

Each value maps to a stable IAB 3.0 ID that is appended to the cat array.


✅ IAB 3.0 Conformance Notes

  • Emits IDs for content.cat and sets "cattax":"<enum>".
  • Supports multiple categories per content (topic IDs + vectors).
  • Strict ID validation: only IDs present in your 3.0 catalog are emitted.
  • SCD-aware: show SCD flags and optionally exclude (--drop-scd).

This tool is not affiliated with IAB. It is an independent utility for compatibility with IAB Content Taxonomy.


🔬 Evaluation (recommended)

Create a small gold set for your domain and run periodic checks:

# (pseudo) compare mapped.json to gold.json for accuracy & unmapped rates
python scripts/eval.py mapped.json gold.json

Gate releases on accuracy deltas so behavior stays stable for audits.


🛠️ Updating Catalogs

Replace the stub JSONs in iab_mapper/data/ with your official datasets:

  • iab_2x.json → include code, label
  • iab_3x.json → include id, label, path[], scd
  • synonyms_*.json → org-specific aliases
  • vectors_*.json → official vector catalogs mapping values to stable 3.0 IDs

Commit with a version bump and note taxonomy_version in your release notes.


🧯 Troubleshooting

  • No matches: lower --fuzzy-cut or enable --use-embeddings.
  • Weird matches: raise thresholds; add synonyms into synonyms_*.json.
  • Offline: pre-bundle ST model; set --emb-model to a local folder path.
  • CSV issues: ensure UTF-8 and header row (label required).
  • Unmapped: inspect --unmapped-out and add overrides/synonyms as needed.

📦 Example Commands

# Strict fuzzy only
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.csv --fuzzy-cut 0.95

# Embeddings on, drop SCD, max 2 topics, custom cattax, collect unmapped
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings --drop-scd --max-topics 2 --cattax 2 --unmapped-out misses.json

📜 License

TBD by Mixpeek. Include IAB attribution in your deployed UI/footer:

“IAB is a registered trademark of the Interactive Advertising Bureau. This tool is an independent utility built by Mixpeek for interoperability with IAB Content Taxonomy standards.”

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iab_mapper-0.3.0.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iab_mapper-0.3.0-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file iab_mapper-0.3.0.tar.gz.

File metadata

  • Download URL: iab_mapper-0.3.0.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for iab_mapper-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b4013337e8e12dee47f028697019feb52854b13a29a00ca248ed9c840ba5b368
MD5 6ff14a478340abe56729824bd5f4b44e
BLAKE2b-256 733b5611da813ebd56b569aa1d717c5740f58b08ec773d20b6a6c72a34ce8cac

See more details on using hashes here.

File details

Details for the file iab_mapper-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: iab_mapper-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for iab_mapper-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a552412129c67e50ff0d72e8c83ff1b6bdbe35b6ed641d534a0d3fefefb188c9
MD5 53491f93991f4f00075d91dccb14135b
BLAKE2b-256 72b73427a231cca8ccfdc31ca5eb0cc01dab27445366dc05c865f6fa5bf78773

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page