Citation pipeline for CDTM trend seminars
Project description
Spring 26 — Citation Pipeline
Extract, validate, deduplicate, and replace inline citations across .docx trend-phase files.
Pipeline
flowchart TD
A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
B --> C["Validate (AMA 11th)"]
C --> D["Incremental dedup(check registry first)"]
D --> E["Save CSV / JSON / dedup_map"]
E --> F{"Ready?"}
F -->|yes / FORCE| G["Replace inline citations"]
F -->|no| H["Fix issues → re-run"]
G --> I["output/*.docx"]
Installation
pip install cdtm-tstools
Or for local development:
uv venv
uv sync
Usage
Run from a directory containing a data/ folder with your .docx files:
# Basic run (validate + deduplicate, no replacement)
tstools
# Point to a different data folder
tstools --data-dir data/fall26
# Enable inline citation replacement
tstools --replace
# Force replacement even with unresolved issues
tstools --replace --force
# Custom file processing order
tstools --file-order my_order.json
# All together
tstools --data-dir data/fall26 --replace --force --file-order my_order.json
| Flag | Description |
|---|---|
--data-dir PATH |
Data directory (default: data/spring26) |
--replace |
Run inline citation replacement |
--force |
Proceed with replacement despite unresolved issues |
--file-order PATH |
JSON file with the file processing order |
Equivalent module invocations: python -m tstools or python -m tstools.main.
File order
FILE_ORDER in tstools/__init__.py defines the default processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in FILE_ORDER is appended at the end alphabetically.
Override it from outside in two ways:
1. CLI flag — pass --file-order path/to/file_order.json
2. Auto-detected — place a file_order.json in your data directory:
[
"E-Human-AI Teams-Intro.docx",
"E-Human-AI Teams-2.docx",
"T-Society-1.docx"
]
If neither is provided, the built-in list from tstools/__init__.py is used.
Incremental deduplication
dedup_map.json is a persistent registry — unique IDs are permanent once assigned.
On each run:
- Citations already in the registry are carried forward unchanged.
- New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
- Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
- Existing numbers are never reassigned.
Human review
Fuzzy matches (≥ 0.70 similarity) are flagged review_needed in the map. After reviewing:
| Decision | Edit in dedup_map.json |
|---|---|
| Confirmed duplicate | Set duplicate_of, match_type: "manual" |
| Confirmed distinct | Set review_flag: "confirmed_distinct" |
| Manual ID assignment | Set unique_id to desired number, match_type: "manual" |
The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same unique_id.
Outputs
| File | Contents |
|---|---|
citations.csv |
All citations + validation issues + dedup metadata |
bibliography.csv |
UniqueID → Citation → SourceIDs |
dedup_map.json |
Persistent registry (append-only) |
output/issues.md |
Human work queue — validation issues by file |
output/*.docx |
Inline citations replaced, References section removed |
Validation (AMA 11th ed.)
Runs on every present file each run. Issues appear in output/issues.md until fixed in the source .docx.
| # | Check | Code |
|---|---|---|
| — | Bare URL | url_only |
| 1 | Author | missing_author |
| 2 | Title | missing_title |
| 3 | Source | missing_source |
| 4 | Year | missing_year |
| 5 | Locator (DOI / URL / vol-page) | missing_locator |
| 6 | Accessed date when URL, no DOI | missing_accessed |
| 7 | No accessed date when DOI | unnecessary_accessed |
Deduplication phases
- URL / DOI — same locator → definitive duplicate
- Exact text — normalised match → definitive duplicate
- Title — normalised title segment match → definitive duplicate
- Fuzzy ≥ 0.70 — flagged
review_needed, not auto-merged
Inline citation replacement
Input formats supported (from source .docx):
| Format | Example input | Output |
|---|---|---|
| Brackets — single | [1] |
[42] |
| Brackets — list | [1,2] |
[18,27] |
| Brackets — range | [1-4] |
[42-45] |
| Superscript — single | ⁵ | [42] |
| Superscript — list | ²˒³ | [18,27] |
All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved ([1]. → [42].).
File naming
E-Human-AI Teams-2.docx → E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx → T-Society-1-1, T-Society-1-2, …
Run tracking and freezing
Each pipeline run is logged in data/<semester>/logs/run_log.json. After a run, you must verify the outputs and set the verified flags to true in run_log.json. The next run then automatically marks the previous entry as frozen.
| Flag | What to check |
|---|---|
verified.duplicates |
Review output/duplicates.md |
verified.not_used |
Review output/not_used.md |
verified.inline_substitution |
Spot-check output/*.docx files |
Frozen runs protect UIDs — citations from frozen runs keep their assigned numbers forever, and the pipeline warns if their bibliography entries are modified. Unfrozen UIDs are reassigned in body-text order on every run, so UIDs stay contiguous as long as nothing is frozen.
Use --skip-gate to bypass the verification gate during iterative development.
Known bugs fixed (v0.2)
1. UIDs assigned in reference-list order instead of body-text order
Symptom: The first citation in the body text (e.g., [23]) got a high UID like 75, while [1] (which appears late in the text) got UID 61.
Root cause: deduplicate() assigns temporary UIDs in reference-list order (the order entries appear in the References section). reorder_unique_ids() was then pre-seeding from dedup_results, which already had every UID filled in — making the body-text reorder a complete no-op.
Fix: reorder_unique_ids() now only pre-seeds UIDs from frozen runs (via frozen_uids parameter). All other UIDs are discarded and reassigned in the order citations first appear in body text across files in FILE_ORDER.
2. UID gaps after merging duplicates
Symptom: After manually flagging a citation as a duplicate (e.g., merging UID 40 into UID 1), UIDs went 1–39, 41–84 with a gap at 40.
Root cause: Same as bug #1 — all previous UIDs were treated as sticky regardless of frozen status. UIDs 41–84 were preserved at their old values instead of shifting down.
Fix: Only UIDs from frozen runs are sticky. When nothing is frozen, every re-run produces contiguous UIDs (1, 2, 3, ...) with no gaps.
3. Split-run bracket citations not detected or replaced
Symptom: Some inline citations like [5,10] or [11,12] were silently skipped during both body-text scanning and replacement. The output .docx still contained unreplaced local numbers.
Root cause: Word internally splits text into runs (formatting spans). A single [5,10] can be stored as three separate runs: [5, / 10 / ]. The pipeline's regex (\[(\d[\d,\s\-]*)\]) scans one run at a time and needs the full bracket pattern in a single string. Split brackets never matched.
Fix: Added _collapse_split_brackets() — a state-machine pre-pass that walks paragraph runs, detects incomplete bracket patterns at run boundaries, and merges them into single runs before replacement. Handles patterns like:
[5,/10/]→[5,10][11/,12].→[11,12].[/10/]→[10]- Chained splits where one merge exposes another (e.g.,
]...text [4,/9/,/15/].)
4. DOIs not extracted when on a separate line
Symptom: Citations with a DOI on its own line (after a line break within the same reference entry) were flagged missing_locator even though the DOI was present.
Root cause: The extraction logic treated each line independently. When a citation's text ended on one line and the DOI started on the next, the DOI line was discarded as a non-citation line.
Fix: extract_citations_from_file() now merges continuation lines matching BARE_DOI_RE (standalone DOI patterns like doi:10.xxxx/...) back into the preceding citation entry.
5. DOIs stripped by accessed-date cleanup
Symptom: A citation had both an "Accessed" date and a DOI (e.g., ...Accessed March 6, 2026. https://doi.org/...). The auto-fix for "unnecessary accessed date when DOI present" stripped the accessed date and the DOI/URL that followed it.
Root cause: ACCESSED_TAIL_RE matched greedily from "Accessed" through the end of the string, removing everything — including the locator.
Fix: fix_citations() now extracts and re-appends any DOI or URL locator found in the stripped tail before discarding the accessed date portion.
6. Double periods after inline citations
Symptom: Some replaced citations produced [42].. (two periods) instead of [42]..
Root cause: When Word stored the closing ] and . in separate runs, the bracket collapse step merged them in a way that preserved the original period while the run that followed also contributed one.
Fix: The collapse logic now correctly transfers trailing punctuation from consumed runs so that no character is duplicated.
7. Body-text scan misordered split-run citations
Symptom: UIDs for a file were not sequential in body-text order. E.g., in T_Legal_5 the first citation in the text ([3, 6]) got UIDs 224/225 while a later citation ([4]) got UID 221.
Root cause: _scan_body_order() had the same split-run problem as bug #3, but in the scanning phase. It processed intact per-run brackets first (finding [4], [8], etc.), then fell back to para.text for split brackets ([3, 6]). Since [3, 6] was split across runs ([3, / 6 / ]), it was added to the order after all intact brackets — even though it appears first in the text.
Fix: _scan_body_order() now calls _collapse_split_brackets() on each paragraph's runs before scanning, exactly like replace_inline_citations() does. This merges split brackets into single runs so the per-run scan finds them in their correct text position.
Structure
tstools/
├── __init__.py paths + FILE_ORDER
├── main.py orchestrator
├── patterns.py centralized regex patterns
├── runs.py run tracking, verification gates, freeze logic
├── utils/
│ ├── utils.py extract, clean, save, verify, registry load/save
│ ├── inline_replacer.py bracket + superscript → bracket replacement
│ └── cite_from_url.py URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│ └── validator.py AMA format checks
└── unique/
└── deduplicator.py four-phase incremental dedup
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cdtm_tstools-0.1.5.tar.gz.
File metadata
- Download URL: cdtm_tstools-0.1.5.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b1eb6ef50f1dac831e62bece8054c85042cd8941c883688341cddf5e7e42545
|
|
| MD5 |
3911a725266c6ee23f6bc1bbb14392f4
|
|
| BLAKE2b-256 |
eee2a0ec3a001fa6cbcebfd9370d16aa0b510ad3ae9a90b23950ef62d26617f7
|
File details
Details for the file cdtm_tstools-0.1.5-py3-none-any.whl.
File metadata
- Download URL: cdtm_tstools-0.1.5-py3-none-any.whl
- Upload date:
- Size: 32.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a07bc6e315387312f7b7a9cd60ae95bd4593bde0b6c5b79c7a697be6b38bb1b9
|
|
| MD5 |
5b6a574d8d4b2e24e656f208ea0de7e2
|
|
| BLAKE2b-256 |
215275b7480b7b5938a976867162ea94f8b0cd29948fb245883c240a1cdb30c4
|