Skip to main content

Citation pipeline for CDTM trend seminars

Project description

Spring 26 — Citation Pipeline

Extract, validate, deduplicate, and replace inline citations across .docx trend-phase files.

Pipeline

flowchart TD
    A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
    B --> C["Validate (AMA 11th)"]
    C --> D["Incremental dedup(check registry first)"]
    D --> E["Save CSV / JSON / dedup_map"]
    E --> F{"Ready?"}
    F -->|yes / FORCE| G["Replace inline citations"]
    F -->|no| H["Fix issues → re-run"]
    G --> I["output/*.docx"]

Installation

pip install cdtm-tstools

Or for local development:

uv venv
uv sync

Usage

Run from a directory containing a data/ folder with your .docx files:

# Basic run (validate + deduplicate, no replacement)
tstools

# Point to a different data folder
tstools --data-dir data/fall26

# Enable inline citation replacement
tstools --replace

# Force replacement even with unresolved issues
tstools --replace --force

# Custom file processing order
tstools --file-order my_order.json

# All together
tstools --data-dir data/fall26 --replace --force --file-order my_order.json
Flag Description
--data-dir PATH Data directory (default: data/spring26)
--replace Run inline citation replacement
--force Proceed with replacement despite unresolved issues
--file-order PATH JSON file with the file processing order

Equivalent module invocations: python -m tstools or python -m tstools.main.

File order

FILE_ORDER in tstools/__init__.py defines the default processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in FILE_ORDER is appended at the end alphabetically.

Override it from outside in two ways:

1. CLI flag — pass --file-order path/to/file_order.json

2. Auto-detected — place a file_order.json in your data directory:

[
    "E-Human-AI Teams-Intro.docx",
    "E-Human-AI Teams-2.docx",
    "T-Society-1.docx"
]

If neither is provided, the built-in list from tstools/__init__.py is used.

Incremental deduplication

dedup_map.json is a persistent registry — unique IDs are permanent once assigned.

On each run:

  • Citations already in the registry are carried forward unchanged.
  • New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
  • Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
  • Existing numbers are never reassigned.

Human review

Fuzzy matches (≥ 0.70 similarity) are flagged review_needed in the map. After reviewing:

Decision Edit in dedup_map.json
Confirmed duplicate Set duplicate_of, match_type: "manual"
Confirmed distinct Set review_flag: "confirmed_distinct"
Manual ID assignment Set unique_id to desired number, match_type: "manual"

The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same unique_id.

Outputs

File Contents
citations.csv All citations + validation issues + dedup metadata
bibliography.csv UniqueID → Citation → SourceIDs
dedup_map.json Persistent registry (append-only)
output/issues.md Human work queue — validation issues by file
output/*.docx Inline citations replaced, References section removed

Validation (AMA 11th ed.)

Runs on every present file each run. Issues appear in output/issues.md until fixed in the source .docx.

# Check Code
Bare URL url_only
1 Author missing_author
2 Title missing_title
3 Source missing_source
4 Year missing_year
5 Locator (DOI / URL / vol-page) missing_locator
6 Accessed date when URL, no DOI missing_accessed
7 No accessed date when DOI unnecessary_accessed

Deduplication phases

  1. URL / DOI — same locator → definitive duplicate
  2. Exact text — normalised match → definitive duplicate
  3. Title — normalised title segment match → definitive duplicate
  4. Fuzzy ≥ 0.70 — flagged review_needed, not auto-merged

Inline citation replacement

Input formats supported (from source .docx):

Format Example input Output
Brackets — single [1] [42]
Brackets — list [1,2] [18,27]
Brackets — range [1-4] [42-45]
Superscript — single [42]
Superscript — list ²˒³ [18,27]

All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved ([1].[42].).

File naming

E-Human-AI Teams-2.docx  →  E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx          →  T-Society-1-1, T-Society-1-2, …

Run tracking and freezing

Each pipeline run is logged in data/<semester>/logs/run_log.json. After a run, you must verify the outputs and set the verified flags to true in run_log.json. The next run then automatically marks the previous entry as frozen.

Flag What to check
verified.duplicates Review output/duplicates.md
verified.not_used Review output/not_used.md
verified.inline_substitution Spot-check output/*.docx files

Frozen runs protect UIDs — citations from frozen runs keep their assigned numbers forever, and the pipeline warns if their bibliography entries are modified. Unfrozen UIDs are reassigned in body-text order on every run, so UIDs stay contiguous as long as nothing is frozen.

Use --skip-gate to bypass the verification gate during iterative development.

Known bugs fixed (v0.2)

1. UIDs assigned in reference-list order instead of body-text order

Symptom: The first citation in the body text (e.g., [23]) got a high UID like 75, while [1] (which appears late in the text) got UID 61.

Root cause: deduplicate() assigns temporary UIDs in reference-list order (the order entries appear in the References section). reorder_unique_ids() was then pre-seeding from dedup_results, which already had every UID filled in — making the body-text reorder a complete no-op.

Fix: reorder_unique_ids() now only pre-seeds UIDs from frozen runs (via frozen_uids parameter). All other UIDs are discarded and reassigned in the order citations first appear in body text across files in FILE_ORDER.

2. UID gaps after merging duplicates

Symptom: After manually flagging a citation as a duplicate (e.g., merging UID 40 into UID 1), UIDs went 1–39, 41–84 with a gap at 40.

Root cause: Same as bug #1 — all previous UIDs were treated as sticky regardless of frozen status. UIDs 41–84 were preserved at their old values instead of shifting down.

Fix: Only UIDs from frozen runs are sticky. When nothing is frozen, every re-run produces contiguous UIDs (1, 2, 3, ...) with no gaps.

3. Split-run bracket citations not detected or replaced

Symptom: Some inline citations like [5,10] or [11,12] were silently skipped during both body-text scanning and replacement. The output .docx still contained unreplaced local numbers.

Root cause: Word internally splits text into runs (formatting spans). A single [5,10] can be stored as three separate runs: [5, / 10 / ]. The pipeline's regex (\[(\d[\d,\s\-]*)\]) scans one run at a time and needs the full bracket pattern in a single string. Split brackets never matched.

Fix: Added _collapse_split_brackets() — a state-machine pre-pass that walks paragraph runs, detects incomplete bracket patterns at run boundaries, and merges them into single runs before replacement. Handles patterns like:

  • [5, / 10 / ][5,10]
  • [11 / ,12].[11,12].
  • [ / 10 / ][10]
  • Chained splits where one merge exposes another (e.g., ]...text [4, / 9 / , / 15 / ].)

4. DOIs not extracted when on a separate line

Symptom: Citations with a DOI on its own line (after a line break within the same reference entry) were flagged missing_locator even though the DOI was present.

Root cause: The extraction logic treated each line independently. When a citation's text ended on one line and the DOI started on the next, the DOI line was discarded as a non-citation line.

Fix: extract_citations_from_file() now merges continuation lines matching BARE_DOI_RE (standalone DOI patterns like doi:10.xxxx/...) back into the preceding citation entry.

5. DOIs stripped by accessed-date cleanup

Symptom: A citation had both an "Accessed" date and a DOI (e.g., ...Accessed March 6, 2026. https://doi.org/...). The auto-fix for "unnecessary accessed date when DOI present" stripped the accessed date and the DOI/URL that followed it.

Root cause: ACCESSED_TAIL_RE matched greedily from "Accessed" through the end of the string, removing everything — including the locator.

Fix: fix_citations() now extracts and re-appends any DOI or URL locator found in the stripped tail before discarding the accessed date portion.

6. Double periods after inline citations

Symptom: Some replaced citations produced [42].. (two periods) instead of [42]..

Root cause: When Word stored the closing ] and . in separate runs, the bracket collapse step merged them in a way that preserved the original period while the run that followed also contributed one.

Fix: The collapse logic now correctly transfers trailing punctuation from consumed runs so that no character is duplicated.

7. Body-text scan misordered split-run citations

Symptom: UIDs for a file were not sequential in body-text order. E.g., in T_Legal_5 the first citation in the text ([3, 6]) got UIDs 224/225 while a later citation ([4]) got UID 221.

Root cause: _scan_body_order() had the same split-run problem as bug #3, but in the scanning phase. It processed intact per-run brackets first (finding [4], [8], etc.), then fell back to para.text for split brackets ([3, 6]). Since [3, 6] was split across runs ([3, / 6 / ]), it was added to the order after all intact brackets — even though it appears first in the text.

Fix: _scan_body_order() now calls _collapse_split_brackets() on each paragraph's runs before scanning, exactly like replace_inline_citations() does. This merges split brackets into single runs so the per-run scan finds them in their correct text position.

Structure

tstools/
├── __init__.py              paths + FILE_ORDER
├── main.py                  orchestrator
├── patterns.py              centralized regex patterns
├── runs.py                  run tracking, verification gates, freeze logic
├── utils/
│   ├── utils.py             extract, clean, save, verify, registry load/save
│   ├── inline_replacer.py   bracket + superscript → bracket replacement
│   └── cite_from_url.py     URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│   └── validator.py         AMA format checks
└── unique/
    └── deduplicator.py      four-phase incremental dedup

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdtm_tstools-0.1.5.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cdtm_tstools-0.1.5-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file cdtm_tstools-0.1.5.tar.gz.

File metadata

  • Download URL: cdtm_tstools-0.1.5.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for cdtm_tstools-0.1.5.tar.gz
Algorithm Hash digest
SHA256 3b1eb6ef50f1dac831e62bece8054c85042cd8941c883688341cddf5e7e42545
MD5 3911a725266c6ee23f6bc1bbb14392f4
BLAKE2b-256 eee2a0ec3a001fa6cbcebfd9370d16aa0b510ad3ae9a90b23950ef62d26617f7

See more details on using hashes here.

File details

Details for the file cdtm_tstools-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for cdtm_tstools-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a07bc6e315387312f7b7a9cd60ae95bd4593bde0b6c5b79c7a697be6b38bb1b9
MD5 5b6a574d8d4b2e24e656f208ea0de7e2
BLAKE2b-256 215275b7480b7b5938a976867162ea94f8b0cd29948fb245883c240a1cdb30c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page