Skip to main content

Keep a sprawling repo telling one story: deterministic codename-leak lint + semantic retrieval over a repo's prose.

Project description

Concord

Concord leak guard tests license

Keep a sprawling repo telling one story.

Concord indexes the prose in a repository — docs, marketing copy, specs, READMEs — and lets you ask it three kinds of question:

  • Lint"does any internal codename / retired term / banned phrase appear in a file that ships publicly?" Deterministic, exact-match, recall‑complete on a known list. Runs in CI or a pre-commit hook.
  • Find"where else do we say something like this?" Exact and semantic matches in one ranked result, so it catches paraphrases a grep would miss.
  • Read"summarise everything we've said about X, and flag where it contradicts itself." Retrieval-first, so only the relevant passages are pulled into context instead of whole files.

Concord is computed, not generated. The lint is regex. The ranking is geometry (cosine + an elbow cutoff). The only place a language model enters is the optional final synthesis of retrieved passages — and even that step is handed only the passages Concord selected, which is where the token savings come from.

Why it exists

Two failure modes plague any repo where strategy, internal notes, and public-facing copy live side by side:

  1. Leaks — an internal codename or a retired product name slips into a published page.
  2. Drift — the same fact (a price, a policy, a product name) is stated three different ways across three files, and nobody notices.

A plain grep catches neither paraphrases nor contradictions. A vector search alone is fuzzy and misses exact strings. Concord runs both signals together.

Token efficiency

Concord earns its keep on the synthesis step: it hands a model only the passages that matter, not the whole repo. Measured on this project's own documentation (a 14,551-passage corpus), answering "find contradictory pricing information" (token counts are a chars/4 estimate):

Approach Tokens into context Gives you the conflicting sentences?
Read the whole directory ~1,800,000 Yes — but it won't fit most context windows, and you pay for all of it on every query.
graphify (concept graph) ~1,600 No — returns concept nodes + file pointers, zero verbatim prices. Tells you what relates to pricing, not where the numbers disagree; you still have to open the files.
Concord (passage retrieval) ~190 Yes — the actual price statements, cited to file:line.

graphify and Concord are complementary, not competitors: graphify maps how concepts connect; Concord retrieves the verbatim prose where a claim lives and where it conflicts. For "show me the contradictory pricing," you need the passages — which is why graphify alone isn't enough.

Honest caveat — completeness queries. These numbers are for targeted questions. For "find all X" sweeps (e.g. "every GDPR commitment"), a small top-k with an aggressive cutoff under-retrieves: it can return four near-identical clauses and miss the scattered rest. That's a recall-vs-tokens trade, and it's exactly where a topic/cluster index helps (see Roadmap). Concord prints what it retrieved so the gap is visible, never hidden.

Updating: only what changed

The index records the commit it was built at (.concord/meta.json) and a content-hash manifest (.concord/manifest.json). concord update re-embeds only the diff:

  • In a git repo: asks git what changed since the indexed commit (or just HEAD~1..HEAD with --last-commit, for a post-commit hook).
  • Outside git (--no-git, or a non-git folder): diffs the content-hash manifest, so a real edit re-embeds and a bare touch does not.

Either way, cost scales with the diff, not the corpus.

In CI — the leak guard + badge

Fail the build if a codename reaches a public file, and stamp a badge on your README:

# .github/workflows/leak-guard.yml
on: [push, pull_request]
jobs:
  leak-guard:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - uses: linnetlabs/concord@v1     # the reusable action
        with: { scope: public }
concord badge .    # -> ![Concord](https://img.shields.io/badge/concord-0%20leaks-brightgreen)

Find drift across history

concord radar .                 # value-conflict candidates (same topic, different number)
concord drift "$49"             # which commits changed this value (git pickaxe)

The driver model

Concord's core is a set of deterministic primitives. Who drives the loop is pluggable:

Driver Surface Relevance judge
Human concordai (Python CLI), live explorer (concord ui) geometry, or your eyes
Agent Claude skill / MCP server the model

Same engine underneath. A human sits in the seat an agent would otherwise occupy.

Install

pip install concord-ai              # lint + exact find (no ML dependencies)
pip install "concord-ai[embeddings]"  # + sentiment.ai embedder for semantic find / read

Embeddings come from sentiment.ai — its sibling package — so Concord inherits a local, auditable, provenance-tracked embedder (e5 on-device by default) rather than calling a hosted API. sentiment.ai is the only embedding backend: Concord never silently swaps in a different model, because that would make a result look the same while being incomparable.

Quickstart

concord init   .                           # scaffold rules.yaml + gitignore it and .concord/
concord lint   .                           # fail CI if a banned term reaches a public file
concord index  .                           # build the semantic index (self-ignored)
concord find   "founding-free pricing"     # exact + semantic hits, cited to file:line
concord read   "what have we said about pricing?"   # retrieve the relevant passages
concord radar  . --verify                  # find contradictions; --verify lets an LLM confirm + name the canonical value
concord resolve .                          # walk confirmed contradictions and apply the fix (interactive; --apply = auto)
concord report . --out report.html         # shareable consistency report (lint + radar)
concord drift  "$49"                       # which commits changed a value (git pickaxe)
concord topics .                           # annotated topic map (browse; --samples to name them)
concord ui     .                           # premium live explorer in your browser (search · topics · radar)

AI is optional — and it's your key

Everything core is free and deterministic: lint, find, index, topics, radar candidates, report. The optional LLM steps — radar --verify, resolve, and naming topics in the explorer — call your own API key (you pay for usage), and the tool is explicit about it everywhere (a status pill, cost tooltips, CLI notes).

  • Set any of ANTHROPIC_API_KEY (preferred — the better judge), OPENAI_API_KEY, DEEPSEEK_API_KEY, GROQ_API_KEY, MISTRAL_API_KEY, OPENROUTER_API_KEY, GEMINI_API_KEY. The explorer's ⚙ picks among the keys you actually have.
  • CONCORD_NO_LLM=1 turns AI off entirely; CONCORD_LLM=<provider> forces one.
  • No key? Everything except verify / resolve / AI-naming still works.

Your real ruleset stays private — enforced, not trusted. concord init copies rules.example.yaml to rules.yaml and adds rules.yaml, *.local.yaml, and .concord/ to your repo's .gitignore. The built index writes its own .concord/.gitignore too. A tool that prevents codename leaks must not leak the codenames — so it makes them uncommittable for you.

Status

Early scaffold. lint works today (no ML required). Semantic find / read and the benchmark harness are in progress. See eval/README.md for the benchmark design (seed-efficiency, stopping-strategy, token-efficiency).

MIT licensed. A Linnet Labs project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

concord_ai-0.1.0.tar.gz (54.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

concord_ai-0.1.0-py3-none-any.whl (56.3 kB view details)

Uploaded Python 3

File details

Details for the file concord_ai-0.1.0.tar.gz.

File metadata

  • Download URL: concord_ai-0.1.0.tar.gz
  • Upload date:
  • Size: 54.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for concord_ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c003c50c7dba2db271f7129394df82e0a6fcc175708d7d845682671ed20681eb
MD5 164f21c7cad4243580e3cd2881e2e164
BLAKE2b-256 c087c5d636bb0bed005af07d00d2e46f7a93eadf2a4aa96d518aff011588d6d9

See more details on using hashes here.

File details

Details for the file concord_ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: concord_ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 56.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for concord_ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87a240fc1884481f5f42fe5e968d0c6111da1bd2397a0a3996d9c17f0333125a
MD5 8cc75def870a186942b976d715b4b91e
BLAKE2b-256 00ff064f02c4717c7ba3c6d0de42c9df46b9bee3beb45c9c6c68d58cf1bf570a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page