Extract field-level metadata from any Looker instance via the API

Project description

looker-fields

The compiled truth about every field in your Looker instance — with an output schema you own.

Hand-rolling Looker metadata pipelines is a tax. Parsing raw .lkml files lies (include: resolution, refinements, view aliasing). The official SDK gives you raw API JSON, not analysis-ready rows. And every team rewrites the same field flattener — missing the same edge cases, hitting the same duplication bug.

looker-fields extracts every field — dimensions, measures, dimension groups, filters, parameters — across every model and explore, with correct model attribution and cross-explore visibility. The output schema is yours: it's a YAML manifest you can edit, override, regenerate.

pip install looker-fields
looker-fields extract -o all_fields.jsonl     # 12K+ fields in seconds, zero dupes

What you get

One row per (project, model, explore, field) with 49 columns covering:

Identity — fully-qualified name, view, original view (after from: aliasing), source LookML file
Classification — category (dimension/measure/filter/parameter), type, is_numeric, is_timeframe, primary_key
Display — label, label_short, group_label, hidden, value_format, value_format_name
LookML source — sql expression (if you have see_lookml), source_file_path, scope
Quality signals — times_used (dead-field detection), total_times_used, tags
Cross-explore visibility — seen_in_model_count, seen_in_explore_count, seen_models[], seen_explores[] — answers "where is this field actually used?"
Explore context — explore label, description, connection, base view
Provenance — extracted_at timestamp, schema_version

Sample row (JSONL):

{"project_name":"thelook","model_name":"thelook","explore_name":"order_items","field_name":"users.email","category":"dimension","field_type":"string","label":"Users Email","view_name":"users","original_view":"users","sql":"${TABLE}.email","source_file_path":"thelook/views/users.view.lkml","primary_key":false,"sortable":true,"can_filter":true,"times_used":1234,"seen_in_explore_count":7,"seen_models":["thelook"],"seen_explores":["thelook::events","thelook::order_items","thelook::orders","thelook::sessions","thelook::users"]}

Use cases

You want to...	Use the column(s)...
Find dead fields nobody uses	`times_used = 0`
Map field lineage across explores	`seen_explores[]`
Audit which fields expose PII	`tags`, `description`, regex on `sql`
Feed a data catalog / metric registry	join on `(model, explore, field_name)`
Detect when a LookML refactor changed something	diff JSONL snapshots across runs
Audit silent refinement drift across the instance (v0.2.1+)	`definition_variant_count > 1`
Trace a field across `from:` join aliases (v0.2.1+)	group by `(original_view, leaf_name)` or use `definition_appearances_count`
Track Looker API drift after an upgrade	`looker-fields refresh-schema`
Build a BI cost model	aggregate `total_times_used` by `view_name`
Push fresh metadata to BigQuery for governance	`looker-fields extract --format bq ...`

Why this is different

Approach	Resolves `include:`	Correct model attribution	Cross-explore visibility	Schema you own
Parse raw `.lkml` files	❌	❌	❌	manual
Drive the official `looker_sdk` directly	✅	⚠️ (default)	❌	none — raw API
Build your own flattener	✅	⚠️ (easy to mess up)	❌	yours, but you wrote it
`looker-fields`	✅	✅ (by construction)	✅	YAML manifest, codegen'd

The duplication bug that breaks naive pipelines: an explore can be defined in model_A AND surfaced in model_B via include:. Naive code keys by (project, explore, field) and Cartesian-explodes. looker-fields keys by (project, model, explore, field) — where model is always the extraction loop's iteration variable, never the API response's nullable explore.model_name. Duplication is impossible by construction.

Field Identity Semantics (v0.2.1+)

Three distinct identity flavors. Conflating them silently misleads on heavily-refined LookML.

Identity flavor	Tuple	Answers	Column(s)
Appearance	`(project, model, explore, field_name)`	"Where is this field visible in the catalog?"	row grain — 1 row per tuple, never collapsed
Definition	`(original_view, leaf_name)` + content hash	"What LookML source produced this field?"	`definition_hash`, `definition_variant_count`, `definition_appearances_count`
Logical	`field_name` alone	"Is this 'the same field' across the instance?"	`seen_in_model_count`, `seen_in_explore_count`, `seen_models[]`, `seen_explores[]`

The v0.2.0 row grain was correct — every appearance is preserved 1:1, no rows dropped. But the seen_in_* summary columns are keyed by field_name alone. When a refinement in model_B adds a pii tag or replaces the SQL on users.email, both model_A.users.email and model_B.users.email rows stamp seen_in_explore_count=2 — implying uniform definition. That was the silent drift. v0.2.1 makes it queryable.

One-query drift audit

SELECT field_name,
       seen_in_explore_count,        -- old logical answer
       definition_variant_count,     -- new content-drift answer
       definition_appearances_count  -- new cross-alias lineage answer
FROM read_json_auto('extract.jsonl')
WHERE definition_variant_count > 1
ORDER BY definition_variant_count DESC, seen_in_explore_count DESC;

Empirical baseline on one real 12,731-field instance: 9.6% of rows had definition_variant_count > 1 (silent refinement drift); 40.2% had definition_appearances_count > 1 (cross-alias semantics seen_in_* couldn't surface).

What we can't tell you (honest limits)

The Looker API does not expose extends_chain, included_via, or any refinement-chain attribution — only the composed result. So definition_hash will split rows whose semantic content actually differs, but cannot tell you which refinement or include caused the divergence. For that, parse the LookML repo directly.

Install

pip install looker-fields

Or for development:

git clone https://github.com/luutuankiet/looker-fields-extraction.git
cd looker-fields-extraction
pip install -e ".[dev]"

Setup

Create .env:

LOOKER_BASE_URL=https://your-instance.cloud.looker.com
LOOKER_CLIENT_ID=your_client_id
LOOKER_CLIENT_SECRET=your_client_secret

API credentials: Looker → Admin → Users → your user → "Edit Keys" → "New API3 Key".

Quickstart

# Show what your instance has
looker-fields info

# Extract everything (JSONL is the default; -o is short for --output)
looker-fields extract -o all_fields.jsonl

# Single model / explore
looker-fields extract --model my_model --explore my_explore -o slice.jsonl

# Round-trip verify a specific explore (re-fetches, diffs, exits 0/1)
looker-fields verify my_model my_explore -o all_fields.jsonl

# Push to BigQuery
looker-fields extract --format bq -o my_project.my_dataset.fields

# Dump one explore's raw API JSON for offline debugging
looker-fields dump my_model my_explore -o raw.json

The manifest is your contract

Most metadata tools force you to live with their output schema. This one inverts that: the output is defined by src/looker_fields/manifest/fields.yaml, which ships as a bundled default but you can override entirely.

# manifest/fields.yaml (excerpt)
columns:
  - name: model_name
    type: str
    api_source: context.model_name   # extraction-loop ground truth (never null)
    default: ''
    description: Always from explore context — the fix for duplication

  - name: times_used
    type: int
    api_source: field.times_used
    default: 0
    description: Count of query usage. Valuable for identifying dead fields

Want to add a column? Edit the YAML.

# Use a custom manifest for one invocation
looker-fields extract --manifest-path ./my_manifest.yaml

# Or install it permanently to XDG config
cp my_manifest.yaml ~/.config/looker-fields/manifest.yaml

# Or set per-invocation via env
LOOKER_FIELDS_MANIFEST=./my_manifest.yaml looker-fields extract

# Regenerate the typed FieldRecord pydantic class to match your manifest
looker-fields regen-types

# Next invocation dynamic-imports your custom contract from
# ~/.cache/looker-fields/_fieldrecord/types.py
# (revert: rm that file)

4-step resolution chain (CLI flag > env var > XDG > bundled). Whichever you set wins predictably.

Drift detection at both ends

When Looker upgrades and the API changes:

# Fetch fresh swagger, run TWO drift detectors:
#   v1 — does the swagger still carry every path the extractor depends on?
#   v2 — does every manifest api_source still resolve against the live swagger?
looker-fields refresh-schema

When you want to know if there are new API attributes you could add to your manifest:

# Surfaces additions: swagger attrs the manifest doesn't reference yet.
looker-fields refresh-manifest

Both commands surface signal. Neither auto-writes — you decide.

Output formats

Format	Flag	Use case
JSONL	`--format jsonl` (default)	Streaming, DuckDB, jq
CSV	`--format csv`	Spreadsheet, diff, manual review
Parquet	`--format parquet`	Columnar analytics, large instances
BigQuery	`--format bq`	Production governance pipelines

Adding a new sink = one writer class subclassing output.Writer.

Architecture

Three-layer codegen surface:

swagger.json (Looker owns) ---> _swagger/types.py (input parsers, extra="allow")
manifest/fields.yaml (you own) ---> _fieldrecord/types.py (output records, extra="forbid")
                              ---> projection.project_field (runtime mapper)

The three-extra-policy invariant:

Layer	Module	Pydantic policy	Why
Input	`_swagger/types.py`	`extra="allow"`	forward-compat with Looker API additions
Config	`manifest/schema.py`	`extra="allow"`	forward-compat with new manifest sections
Output	`_fieldrecord/types.py`	`extra="forbid"`	strict contract for downstream consumers

Client overrides flow through XDG cache + dynamic import: edit YAML, run regen-types, next program startup loads your contract instead of the bundled one. No site-packages write needed.

Roadmap

This is Fields v1 of a multi-entity framework. Same manifest-native pattern will land for:

Models (v2) — model-level metadata + project lineage
Explores (v3) — explore graphs + join semantics
Looks / Dashboards (v4-v5) — saved-query metadata + dashboard composition

Contributing

# Run the full suite (43 tests)
pytest tests/ -v

# Regenerate the bundled manifest after editing docs/FIELD_SPEC.md
python scripts/parse_field_spec_to_manifest.py

# Regenerate the bundled FieldRecord after editing the manifest
python scripts/regen_fieldrecord.py

PRs welcome. The codebase is intentionally small (~2K LOC) and aggressively unit-tested. Adding a column = YAML edit + one regen + commit; adding a sink = one writer class.

License

Apache 2.0

Project details

Release history Release notifications | RSS feed

This version

0.2.1

May 22, 2026

0.2.0

May 22, 2026

0.1.1

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

looker_fields-0.2.1.tar.gz (282.8 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

looker_fields-0.2.1-py3-none-any.whl (52.0 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file looker_fields-0.2.1.tar.gz.

File metadata

Download URL: looker_fields-0.2.1.tar.gz
Upload date: May 22, 2026
Size: 282.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for looker_fields-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`a6db1ac9a704b90008b82cdf98f634288eed156bcf5d6632ccd2a2fb9a3bb549`
MD5	`0cc350f9f6ee3044d906f723e1b2bafc`
BLAKE2b-256	`f45d90ad19e8546785aec8c4606a4e364228d6353916505942a10aedbf78e716`

See more details on using hashes here.

Provenance

The following attestation bundles were made for looker_fields-0.2.1.tar.gz:

Publisher: release.yaml on luutuankiet/looker-fields-extraction

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: looker_fields-0.2.1.tar.gz
- Subject digest: a6db1ac9a704b90008b82cdf98f634288eed156bcf5d6632ccd2a2fb9a3bb549
- Sigstore transparency entry: 1608932618
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: luutuankiet/looker-fields-extraction@6c15f5f9188a5f13cd60a784df609c86b214eede
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/luutuankiet
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@6c15f5f9188a5f13cd60a784df609c86b214eede
- Trigger Event: push

File details

Details for the file looker_fields-0.2.1-py3-none-any.whl.

File metadata

Download URL: looker_fields-0.2.1-py3-none-any.whl
Upload date: May 22, 2026
Size: 52.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for looker_fields-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`045b5428f624b8321cbe94b6f0d3e5531560bfcbe71f1d5ba9d3719c081c9d6a`
MD5	`15d20ed74563f9e4b9b5b128b450094d`
BLAKE2b-256	`7a3220c487014dbf680d8446e702b8dc9e79f0206178a51fad40974f5919bc03`

See more details on using hashes here.

Provenance

The following attestation bundles were made for looker_fields-0.2.1-py3-none-any.whl:

Publisher: release.yaml on luutuankiet/looker-fields-extraction

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: looker_fields-0.2.1-py3-none-any.whl
- Subject digest: 045b5428f624b8321cbe94b6f0d3e5531560bfcbe71f1d5ba9d3719c081c9d6a
- Sigstore transparency entry: 1608932708
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: luutuankiet/looker-fields-extraction@6c15f5f9188a5f13cd60a784df609c86b214eede
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/luutuankiet
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@6c15f5f9188a5f13cd60a784df609c86b214eede
- Trigger Event: push

looker-fields 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

looker-fields

What you get

Use cases

Why this is different

Field Identity Semantics (v0.2.1+)

One-query drift audit

What we can't tell you (honest limits)

Install

Setup

Quickstart

The manifest is your contract

Drift detection at both ends

Output formats

Architecture

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance