Skip to main content

Verify answer citations refer to supplied source ids and that cited sources actually support the claims. Python port of @mukundakatta/citation-integrity-check.

Project description

citation-integrity-check

PyPI Python License: MIT

Verify answer citations refer to supplied source ids and that cited sources actually support the claims. Zero runtime dependencies.

Python port of @mukundakatta/citation-integrity-check. The JS sibling has the original API; this README sticks to the Python surface.

Install

pip install citation-integrity-check

Usage

from citation_integrity_check import verify

sources = [
    {"id": "1", "text": "Photosynthesis converts light into chemical energy in plants."},
    {"id": "abc123", "text": "Chlorophyll absorbs red and blue wavelengths of light."},
]
answer = (
    "Plants use photosynthesis to convert light into energy [1]. "
    "Chlorophyll absorbs red and blue light [id:abc123]."
)

result = verify(answer, sources)

result.ok            # True if no missing ids and no unsupported claims
result.missing       # list[str]    -- cited ids that don't exist in sources
result.unsupported   # list[Claim]  -- sentences with no valid supporting citation
result.coverage      # float in [0, 1] -- fraction of sentences with a valid citation

Citation forms

Two markers are recognized inside the answer:

Form Resolves to
[1] sources[0] (1-based index) and source.id == "1"
[id:abc] source.id == "abc"

Anything else inside brackets (like [Note]) is ignored, so stylistic prose doesn't count as a citation.

How "unsupported" is decided

A sentence is unsupported when any of these is true:

  • It has no citation marker at all (reason="no_citation").
  • All cited ids are missing from sources (reason="missing_source").
  • The cited source's text doesn't share enough non-stopword tokens with the sentence (reason="insufficient_overlap").

Token-overlap is |claim_tokens & source_tokens| / |claim_tokens|, with a small built-in stopword list. The threshold is tunable:

verify(answer, sources, support_threshold=0.5)  # stricter

API differences from the JS sibling

  • Returns a CitationResult dataclass with unsupported claims (per-sentence) instead of the JS unused ids list.
  • Adds the [id:foo] named-citation form alongside numeric [N].
  • Adds the token-overlap support_threshold to verify the cited source actually mentions the claim.

See the JS sibling's README for the full design notes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citation_integrity_check-0.1.0.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citation_integrity_check-0.1.0-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file citation_integrity_check-0.1.0.tar.gz.

File metadata

File hashes

Hashes for citation_integrity_check-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1ad29342327a1db1ad5596bfa8164e11eccedacef876e428693a17c51b4a186d
MD5 994e73b3b9fe4d6f9d4a9e8bfafda365
BLAKE2b-256 d32b01bd3925f79165b4076ca03a429838bf8d0c3b211d0019110d8e95b5ef11

See more details on using hashes here.

File details

Details for the file citation_integrity_check-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for citation_integrity_check-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee8184e312640fa644487de43d6c78e6b8b244b9e81fbccb2318271c6d37dec2
MD5 2988eca09836c0e1606ceda9d6ba3b14
BLAKE2b-256 4ae876cb2186ff8c3400470be02da7c528e11e58c57744168bb1c44748ff288e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page