Citation pipeline for CDTM trend seminars
Project description
Spring 26 — Citation Pipeline
Extract, validate, deduplicate, and replace inline citations across .docx trend-phase files.
Pipeline
flowchart TD
A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
B --> C["Validate (AMA 11th)"]
C --> D["Incremental dedup(check registry first)"]
D --> E["Save CSV / JSON / dedup_map"]
E --> F{"Ready?"}
F -->|yes / FORCE| G["Replace inline citations"]
F -->|no| H["Fix issues → re-run"]
G --> I["output/*.docx"]
Quick start
uv venv
uv sync
python -m tstools.main
Paths in tstools/__init__.py, flags in tstools/main.py:
DO_REPLACE = True # skip inline replacement when False
FORCE_OUTPUT = False # override verification gate
File order
FILE_ORDER in tstools/__init__.py defines the canonical processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in FILE_ORDER is appended at the end alphabetically.
FILE_ORDER = [
"T-Technology-1.docx",
"E-Blah-2.docx",
# …
]
Incremental deduplication
dedup_map.json is a persistent registry — unique IDs are permanent once assigned.
On each run:
- Citations already in the registry are carried forward unchanged.
- New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
- Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
- Existing numbers are never reassigned.
Human review
Fuzzy matches (≥ 0.70 similarity) are flagged review_needed in the map. After reviewing:
| Decision | Edit in dedup_map.json |
|---|---|
| Confirmed duplicate | Set duplicate_of, match_type: "manual" |
| Confirmed distinct | Set review_flag: "confirmed_distinct" |
| Manual ID assignment | Set unique_id to desired number, match_type: "manual" |
The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same unique_id.
Outputs
| File | Contents |
|---|---|
citations.csv |
All citations + validation issues + dedup metadata |
bibliography.csv |
UniqueID → Citation → SourceIDs |
dedup_map.json |
Persistent registry (append-only) |
output/issues.md |
Human work queue — validation issues by file |
output/*.docx |
Inline citations replaced, References section removed |
Validation (AMA 11th ed.)
Runs on every present file each run. Issues appear in output/issues.md until fixed in the source .docx.
| # | Check | Code |
|---|---|---|
| — | Bare URL | url_only |
| 1 | Author | missing_author |
| 2 | Title | missing_title |
| 3 | Source | missing_source |
| 4 | Year | missing_year |
| 5 | Locator (DOI / URL / vol-page) | missing_locator |
| 6 | Accessed date when URL, no DOI | missing_accessed |
| 7 | No accessed date when DOI | unnecessary_accessed |
Deduplication phases
- URL / DOI — same locator → definitive duplicate
- Exact text — normalised match → definitive duplicate
- Title — normalised title segment match → definitive duplicate
- Fuzzy ≥ 0.70 — flagged
review_needed, not auto-merged
Inline citation replacement
Input formats supported (from source .docx):
| Format | Example input | Output |
|---|---|---|
| Brackets — single | [1] |
[42] |
| Brackets — list | [1,2] |
[18,27] |
| Brackets — range | [1-4] |
[42-45] |
| Superscript — single | ⁵ | [42] |
| Superscript — list | ²˒³ | [18,27] |
All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved ([1]. → [42].).
File naming
E-Human-AI Teams-2.docx → E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx → T-Society-1-1, T-Society-1-2, …
Structure
tstools/
├── __init__.py paths + FILE_ORDER
├── main.py orchestrator
├── utils/
│ ├── utils.py extract, clean, save, verify, registry load/save
│ ├── inline_replacer.py bracket + superscript → bracket replacement
│ └── cite_from_url.py URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│ └── validator.py AMA format checks
└── unique/
└── deduplicator.py four-phase incremental dedup
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cdtm_tstools-0.1.1.tar.gz.
File metadata
- Download URL: cdtm_tstools-0.1.1.tar.gz
- Upload date:
- Size: 17.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ce490fa6743ed468beeb0beb34250fb2456940e3662ad8a0ff01df19273ebc3
|
|
| MD5 |
29bb5c57515a56e591580fb8ec957b06
|
|
| BLAKE2b-256 |
14c58c6243d380f4bd0cfa46d331a7efee468dd3a12ed306f67ee695ffcc23f4
|
File details
Details for the file cdtm_tstools-0.1.1-py3-none-any.whl.
File metadata
- Download URL: cdtm_tstools-0.1.1-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10b3920d590f46171f59b2f78d35229de90ee2347a0d54e6af41627f8e356a24
|
|
| MD5 |
691c1b5966dabc3a0e86713e617331af
|
|
| BLAKE2b-256 |
77639a91b6c455235716cbce2944922c907d90694800356be21d632ce9e6b252
|