Verify that your sources still say what you think they say

These details have not been verified by PyPI

Project links

Project description

apysource

Python 3.12+

AIs hallucinate citations. Link rot silently breaks the real ones. Silent edits change what your sources actually say.

apysource is an automated verifier: define what text you expect at which URL, and it fetches, caches, and checks that it still matches. Use it as a CI gate, a research notebook guard, or a self-correction layer for AI-generated content — the tool can verify its own output.

Install

pip install apysource

Requires Python 3.12+.

Quick start

1. Define your sources

Create sources.yaml:

sources:
  - label: "UN Charter"
    url: "https://www.un.org/en/about-us/un-charter/full-text"
    type: text/html
    fragments:
      - label: "Preamble"
        section: "Preamble"
        snippet: "to save succeeding generations from the scourge of war"
      - label: "Article 2 principles"
        section: "Article 2, paragraph 1"
        snippet: "The Organization and its Members, in pursuit of the Purposes stated in Article 1, shall act in accordance with the following Principles"

2. Check

apysource check sources.yaml

apysource fetches the page (caching it on disk), finds the section by name, and checks that your snippet appears in the result. Cached pages aren't re-fetched on subsequent runs.

======================================================================
  apysource Verification Report
======================================================================

  [PASS] Fragments: cache resolution.................. 2/2
  [PASS] Fragments: content extraction................ 2/2
  [PASS] Fragments: snippet verified.................. 2/2

  ======================================================================
  Summary: 3 PASS, 0 FAIL, 0 WARN
  EXIT CODE: 0 (all checks passed)
  ======================================================================

3. Discover

Use locate to find how apysource would target a snippet, then add to save it:

# Find where a snippet lives in a page
apysource locate "https://www.un.org/en/about-us/un-charter/full-text" \
  "to save succeeding generations from the scourge of war"

# Add it directly to your sources file
apysource add sources.yaml "https://www.un.org/en/about-us/un-charter/full-text" \
  "to save succeeding generations from the scourge of war" \
  --label "Preamble"

locate outputs a YAML fragment you can paste directly. add writes it to the file for you. Use locate --ttl for Turtle output with full Web Annotation alignment.

Targeting content

apysource supports several ways to pinpoint where in a document your snippet lives:

Targetter	Key	Example	Best for
Section	`section`	`"Chapter I, Article 1"`	Structured documents (HTML, Markdown, Wikitext, RFC)
CSS selector	`selector`	`"div.content p"`	HTML pages
Line range	`lines`	`"40-41"`	Plain text, RFCs
Repo location	`location`	`"chapter:1"`	Repository modules (Gutenberg, Wikisource, etc.)

Section selectors are the most versatile — they work across HTML, Markdown, Wikitext, and RFC plain text. They support roman numeral equivalence (Chapter IV = Chapter 4), nested paths (Chapter I, Article 1, paragraph 2), and quoted titles ('The Fox and the Grapes').

CSS selectors target HTML elements directly. Useful when section headings aren't available or you need a specific element.

Line ranges extract by line number (1-based, inclusive). Useful for plain text and RFCs.

If no targetter is given, apysource checks the full page text for your snippet.

YAML schema

Each YAML file has a top-level sources list. Each source has nested fragments.

Source properties

Key	What it does
`label`	Name of the source (required)
`url`	URL to fetch (required)
`type`	IANA media type: `text/html`, `text/plain`, `text/markdown`, etc. Short names (`html`, `plain-text`) also accepted. Auto-detected if omitted.
`language`	Language code, RFC 5646 (metadata)
`title`	Document title (metadata)
`date`	Publication or access date (metadata)
`part_of`	Parent source label (for hierarchical sources)
`isbn`	International Standard Book Number
`doi`	Digital Object Identifier
`publisher`	Publisher name
`edition`	Edition or version
`license`	License URI

Fragment properties

Key	What it does
`label`	Name of the fragment (required)
`snippet`	The text you expect to find
`selector`	CSS selector to narrow extraction (HTML)
`lines`	Line range to extract, e.g. `30-35`
`section`	Human-readable section selector, e.g. `Chapter I, Article 1`
`location`	Repo-specific location hint (e.g. `chapter:1`)
`page_start`	Starting page number (for print sources)
`page_end`	Ending page number (for print sources)

CLI

apysource [-c config.toml] <command> [args...]

Command	What it does
`check [sources.yaml] [--provenance file.ttl]`	Fetch, extract, and verify all snippets
`locate <url> <snippet>`	Find a snippet in a page, show the targetter
`add <file> <url> <snippet>`	Locate a snippet and add it to a YAML file
`validate`	Check that `.ttl` files parse correctly (with optional SHACL)

Without -c, apysource uses built-in defaults (all built-in repos enabled). Pass -c config.toml to customize repos and HTTP settings (requires pip install apysource[dev]).

Pass --provenance file.ttl to check to write a PROV-O graph recording which fragments were verified, when, and by which activity.

Python API

from pathlib import Path
from apysource.yaml_input import load_yaml
from apysource.verification import run_checks, print_report
from apysource.repos import RepoRegistry

g = load_yaml(Path("sources.yaml"))
results = run_checks(g, [{"name": "Fragments", "class_uri": ..., "mode": "chain"}],
                     RepoRegistry([]))
print_report(results)

Key modules:

from apysource.resolution import resolve_chain, get_text
from apysource.verification import run_checks, print_report
from apysource.repos import BaseRepo, RepoRegistry
from apysource.graph import load_triples
from apysource.http import CachedFetcher
from apysource.yaml_input import load_yaml
from apysource.formats import detect_format, extract_content, locate_snippet

Advanced: RDF/Turtle input

For projects that already use RDF, you can define sources in Turtle instead of YAML:

@prefix sv:      <https://alganet.github.io/apysource#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix oa:      <http://www.w3.org/ns/oa#> .
@prefix schema:  <https://schema.org/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix ex:      <http://example.org/un#> .

ex:un_charter a sv:Source ;
    rdfs:label "UN Charter" ;
    schema:url "https://www.un.org/en/about-us/un-charter/full-text" ;
    dcterms:format "text/html" .

ex:preamble a sv:Fragment ;
    rdfs:label "Preamble" ;
    oa:motivatedBy oa:identifying ;
    oa:hasTarget [
        a oa:SpecificResource ;
        oa:hasSource ex:un_charter ;
        oa:hasSelector [
            a oa:TextQuoteSelector ;
            oa:exact "to save succeeding generations from the scourge of war"
        ] ;
        oa:hasSelector [
            a sv:SectionSelector ;
            rdf:value "Preamble"
        ]
    ] .

The sv: vocabulary is intentionally minimal — it only defines classes and properties with no standard equivalent. Everything else uses standard properties directly.

The RDF path requires a TOML config file (-c) to wire up the CLI context and repos. See defaults.toml for a full template.

RDF properties

Standard properties used on sources:

Property	What it does
`schema:url`	The URL to fetch
`dcterms:format`	IANA media type (`text/html`, `text/plain`)
`dcterms:title`	Document title
`dcterms:issued`	Publication or access date
`dcterms:language`	Language code (RFC 5646)
`dcterms:publisher`	Publisher name
`dcterms:license`	License URI
`dcterms:isPartOf`	Hierarchical sources (chapter of a book)
`bibo:isbn`	ISBN
`bibo:doi`	DOI
`bibo:pageStart` / `bibo:pageEnd`	Page numbers

OA properties used on fragments:

Property	What it does
`oa:hasTarget`	Links to `oa:SpecificResource` with `oa:hasSource` → Source
`oa:TextQuoteSelector` / `oa:exact`	The snippet text to verify
`oa:CssSelector` / `rdf:value`	CSS selector for HTML extraction
`sv:SectionSelector` / `rdf:value`	Human-readable section path (custom)
`oa:motivatedBy oa:identifying`	Annotation purpose

Properties unique to sv::

Property	What it does
`sv:sourceLocation`	Opaque repo-specific location (e.g. `chapter:1`)
`sv:sourceLines`	Line range (e.g. `10-20`)
`sv:edition`	Edition or version string
`sv:verificationStatus`	`verified`, `failed`, or `pending`

Vocabulary design

The sv: namespace defines only what has no standard equivalent — 5 classes and 4 properties. Everything else uses established vocabularies directly:

Web Annotation (OA): Fragments are oa:Annotation instances. Source links, selectors, and snippet text all use native OA properties — no wrapper aliases.
Dublin Core (dcterms): Source metadata (title, date, language, format, publisher, license) uses DC terms directly.
BIBO: Bibliographic identifiers (ISBN, DOI, page numbers) use BIBO properties directly.
PROV-O: Sources are prov:Entity. Verification activities use prov:wasGeneratedBy, prov:startedAtTime, prov:endedAtTime.
SHACL: vocab/shapes.ttl validates Sources, Fragments, and Terms.

Advanced: repository modules

The generic path (CSS selectors, line ranges, section selectors) works for most web pages. For sources that need special handling — multi-page works, API-based sites, structured text formats — repository modules handle the crawling and extraction.

Built-in repos

Repo	Handles	Location format
`ArchiveRepo`	archive.org	`lines:N-M`
`GutenbergRepo`	Project Gutenberg	`chapter:N`, title match
`WikisourceRepo`	Wikisource	`section:Name`, subpage match
`WiktionaryRepo`	Wiktionary	term name, `language/section`

All built-in repos are enabled by default. Most URLs work without a specialized repo — the generic fetcher + targetters (section selectors, CSS, line ranges) handle any web page. Repos are for sources that need multi-page crawling or domain-specific extraction. To customize URL patterns or add your own repos, use a TOML config file. See defaults.toml.

Writing a custom repo

from apysource.repos import BaseRepo

class MyRepo(BaseRepo):
    NAME = "myrepo"

    def url_to_key(self, url):
        m = self.url_pattern.search(url)
        return m.group(1) if m else None

    def resolve_location(self, location, key):
        path = self.cache_dir / key / "content.txt"
        return path if path.exists() else None

BaseRepo requires url_pattern and base_url (from TOML config). cache_dir and http_client come from the registry. Override extract_content for custom extraction logic.

Development

git clone <repo-url> && cd apysource
pip install -e .[dev]

make test               # run unit tests
make lint               # type checking with mypy
make coverage           # run tests with coverage
make check              # full verification gate (lint + coverage)
make compile-defaults   # regenerate _defaults.py from defaults.toml

License

ISC

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Mar 24, 2026

This version

0.3.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apysource-0.3.0.tar.gz (60.3 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

apysource-0.3.0-py3-none-any.whl (50.4 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file apysource-0.3.0.tar.gz.

File metadata

Download URL: apysource-0.3.0.tar.gz
Upload date: Mar 24, 2026
Size: 60.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for apysource-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`2caa46cf936790082e3ed5a45dbad5268f8b6d234e38b78af3f78cdfc2b67992`
MD5	`673849ce819ca12ba09294979b33a3c1`
BLAKE2b-256	`6207414649fe01515d8eee981db7ff571c1bc26ef123becc33126b65ceb89bdc`

See more details on using hashes here.

File details

Details for the file apysource-0.3.0-py3-none-any.whl.

File metadata

Download URL: apysource-0.3.0-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 50.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for apysource-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a9011db4a2a74eb06d1de238ee82627baea1627706e9a83dda22ab74ebe9b013`
MD5	`60b45994116f0e5ceb3d71e55418cfec`
BLAKE2b-256	`bce4373bf5dd2f21cd6be79107360ea46212d4f6db4825af4c102e634ce426dd`

See more details on using hashes here.

apysource 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

apysource

Install

Quick start

1. Define your sources

2. Check

3. Discover

Targeting content

YAML schema

Source properties

Fragment properties

CLI

Python API

Advanced: RDF/Turtle input

RDF properties

Vocabulary design

Advanced: repository modules

Built-in repos

Writing a custom repo

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes