Skip to main content

Citation pipeline for CDTM trend seminars

Project description

Spring 26 — Citation Pipeline

Extract, validate, deduplicate, and replace inline citations across .docx trend-phase files.

Pipeline

flowchart TD
    A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
    B --> C["Validate (AMA 11th)"]
    C --> D["Incremental dedup(check registry first)"]
    D --> E["Save CSV / JSON / dedup_map"]
    E --> F{"Ready?"}
    F -->|yes / FORCE| G["Replace inline citations"]
    F -->|no| H["Fix issues → re-run"]
    G --> I["output/*.docx"]

Quick start

uv venv
uv sync
python -m tstools.main

Paths in tstools/__init__.py, flags in tstools/main.py:

DO_REPLACE   = True   # skip inline replacement when False
FORCE_OUTPUT = False  # override verification gate

File order

FILE_ORDER in tstools/__init__.py defines the canonical processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in FILE_ORDER is appended at the end alphabetically.

FILE_ORDER = [
    "T-Technology-1.docx",
    "E-Blah-2.docx",
    # …
]

Incremental deduplication

dedup_map.json is a persistent registry — unique IDs are permanent once assigned.

On each run:

  • Citations already in the registry are carried forward unchanged.
  • New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
  • Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
  • Existing numbers are never reassigned.

Human review

Fuzzy matches (≥ 0.70 similarity) are flagged review_needed in the map. After reviewing:

Decision Edit in dedup_map.json
Confirmed duplicate Set duplicate_of, match_type: "manual"
Confirmed distinct Set review_flag: "confirmed_distinct"
Manual ID assignment Set unique_id to desired number, match_type: "manual"

The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same unique_id.

Outputs

File Contents
citations.csv All citations + validation issues + dedup metadata
bibliography.csv UniqueID → Citation → SourceIDs
dedup_map.json Persistent registry (append-only)
output/issues.md Human work queue — validation issues by file
output/*.docx Inline citations replaced, References section removed

Validation (AMA 11th ed.)

Runs on every present file each run. Issues appear in output/issues.md until fixed in the source .docx.

# Check Code
Bare URL url_only
1 Author missing_author
2 Title missing_title
3 Source missing_source
4 Year missing_year
5 Locator (DOI / URL / vol-page) missing_locator
6 Accessed date when URL, no DOI missing_accessed
7 No accessed date when DOI unnecessary_accessed

Deduplication phases

  1. URL / DOI — same locator → definitive duplicate
  2. Exact text — normalised match → definitive duplicate
  3. Title — normalised title segment match → definitive duplicate
  4. Fuzzy ≥ 0.70 — flagged review_needed, not auto-merged

Inline citation replacement

Input formats supported (from source .docx):

Format Example input Output
Brackets — single [1] [42]
Brackets — list [1,2] [18,27]
Brackets — range [1-4] [42-45]
Superscript — single [42]
Superscript — list ²˒³ [18,27]

All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved ([1].[42].).

File naming

E-Human-AI Teams-2.docx  →  E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx          →  T-Society-1-1, T-Society-1-2, …

Structure

tstools/
├── __init__.py              paths + FILE_ORDER
├── main.py                  orchestrator
├── utils/
│   ├── utils.py             extract, clean, save, verify, registry load/save
│   ├── inline_replacer.py   bracket + superscript → bracket replacement
│   └── cite_from_url.py     URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│   └── validator.py         AMA format checks
└── unique/
    └── deduplicator.py      four-phase incremental dedup

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdtm_tstools-0.1.0.tar.gz (17.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cdtm_tstools-0.1.0-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file cdtm_tstools-0.1.0.tar.gz.

File metadata

  • Download URL: cdtm_tstools-0.1.0.tar.gz
  • Upload date:
  • Size: 17.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for cdtm_tstools-0.1.0.tar.gz
Algorithm Hash digest
SHA256 44d3f204bf3a240b699aba927cb1d4802f6dae836bf3fe86a661f2afad6e3253
MD5 8b9f6a52b984f5cf502b2ba755680688
BLAKE2b-256 eab975f923c7551d4e93c9b17776c7739c40272a15a63c71a9b2ddaadad4bad5

See more details on using hashes here.

File details

Details for the file cdtm_tstools-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cdtm_tstools-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2be8a8de2973594ba32a43277cdfb5424d1be20635f6082270383bfda7d14551
MD5 41d6170f654f19745d917cf1cf3d0a4e
BLAKE2b-256 92d178b682b83a843ff67753e2a1ed1c5e917c52908e488eb3160179cb1228ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page