Skip to main content

Citation pipeline for CDTM trend seminars

Project description

Spring 26 — Citation Pipeline

Extract, validate, deduplicate, and replace inline citations across .docx trend-phase files.

Pipeline

flowchart TD
    A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
    B --> C["Validate (AMA 11th)"]
    C --> D["Incremental dedup(check registry first)"]
    D --> E["Save CSV / JSON / dedup_map"]
    E --> F{"Ready?"}
    F -->|yes / FORCE| G["Replace inline citations"]
    F -->|no| H["Fix issues → re-run"]
    G --> I["output/*.docx"]

Quick start

uv venv
uv sync
python -m tstools.main

Paths in tstools/__init__.py, flags in tstools/main.py:

DO_REPLACE   = True   # skip inline replacement when False
FORCE_OUTPUT = False  # override verification gate

File order

FILE_ORDER in tstools/__init__.py defines the canonical processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in FILE_ORDER is appended at the end alphabetically.

FILE_ORDER = [
    "T-Technology-1.docx",
    "E-Blah-2.docx",
    # …
]

Incremental deduplication

dedup_map.json is a persistent registry — unique IDs are permanent once assigned.

On each run:

  • Citations already in the registry are carried forward unchanged.
  • New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
  • Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
  • Existing numbers are never reassigned.

Human review

Fuzzy matches (≥ 0.70 similarity) are flagged review_needed in the map. After reviewing:

Decision Edit in dedup_map.json
Confirmed duplicate Set duplicate_of, match_type: "manual"
Confirmed distinct Set review_flag: "confirmed_distinct"
Manual ID assignment Set unique_id to desired number, match_type: "manual"

The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same unique_id.

Outputs

File Contents
citations.csv All citations + validation issues + dedup metadata
bibliography.csv UniqueID → Citation → SourceIDs
dedup_map.json Persistent registry (append-only)
output/issues.md Human work queue — validation issues by file
output/*.docx Inline citations replaced, References section removed

Validation (AMA 11th ed.)

Runs on every present file each run. Issues appear in output/issues.md until fixed in the source .docx.

# Check Code
Bare URL url_only
1 Author missing_author
2 Title missing_title
3 Source missing_source
4 Year missing_year
5 Locator (DOI / URL / vol-page) missing_locator
6 Accessed date when URL, no DOI missing_accessed
7 No accessed date when DOI unnecessary_accessed

Deduplication phases

  1. URL / DOI — same locator → definitive duplicate
  2. Exact text — normalised match → definitive duplicate
  3. Title — normalised title segment match → definitive duplicate
  4. Fuzzy ≥ 0.70 — flagged review_needed, not auto-merged

Inline citation replacement

Input formats supported (from source .docx):

Format Example input Output
Brackets — single [1] [42]
Brackets — list [1,2] [18,27]
Brackets — range [1-4] [42-45]
Superscript — single [42]
Superscript — list ²˒³ [18,27]

All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved ([1].[42].).

File naming

E-Human-AI Teams-2.docx  →  E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx          →  T-Society-1-1, T-Society-1-2, …

Structure

tstools/
├── __init__.py              paths + FILE_ORDER
├── main.py                  orchestrator
├── utils/
│   ├── utils.py             extract, clean, save, verify, registry load/save
│   ├── inline_replacer.py   bracket + superscript → bracket replacement
│   └── cite_from_url.py     URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│   └── validator.py         AMA format checks
└── unique/
    └── deduplicator.py      four-phase incremental dedup

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdtm_tstools-0.1.1.tar.gz (17.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cdtm_tstools-0.1.1-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file cdtm_tstools-0.1.1.tar.gz.

File metadata

  • Download URL: cdtm_tstools-0.1.1.tar.gz
  • Upload date:
  • Size: 17.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for cdtm_tstools-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7ce490fa6743ed468beeb0beb34250fb2456940e3662ad8a0ff01df19273ebc3
MD5 29bb5c57515a56e591580fb8ec957b06
BLAKE2b-256 14c58c6243d380f4bd0cfa46d331a7efee468dd3a12ed306f67ee695ffcc23f4

See more details on using hashes here.

File details

Details for the file cdtm_tstools-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cdtm_tstools-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 10b3920d590f46171f59b2f78d35229de90ee2347a0d54e6af41627f8e356a24
MD5 691c1b5966dabc3a0e86713e617331af
BLAKE2b-256 77639a91b6c455235716cbce2944922c907d90694800356be21d632ce9e6b252

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page