Citation pipeline for CDTM trend seminars
Project description
Spring 26 — Citation Pipeline
Extract, validate, deduplicate, and replace inline citations across .docx trend-phase files.
Pipeline
flowchart TD
A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
B --> C["Validate (AMA 11th)"]
C --> D["Incremental dedup(check registry first)"]
D --> E["Save CSV / JSON / dedup_map"]
E --> F{"Ready?"}
F -->|yes / FORCE| G["Replace inline citations"]
F -->|no| H["Fix issues → re-run"]
G --> I["output/*.docx"]
Installation
pip install cdtm-tstools
Or for local development:
uv venv
uv sync
Usage
Run from a directory containing a data/ folder with your .docx files:
# Basic run (validate + deduplicate, no replacement)
tstools
# Point to a different data folder
tstools --data-dir data/fall26
# Enable inline citation replacement
tstools --replace
# Force replacement even with unresolved issues
tstools --replace --force
# Custom file processing order
tstools --file-order my_order.json
# All together
tstools --data-dir data/fall26 --replace --force --file-order my_order.json
| Flag | Description |
|---|---|
--data-dir PATH |
Data directory (default: data/spring26) |
--replace |
Run inline citation replacement |
--force |
Proceed with replacement despite unresolved issues |
--file-order PATH |
JSON file with the file processing order |
Equivalent module invocations: python -m tstools or python -m tstools.main.
File order
FILE_ORDER in tstools/__init__.py defines the default processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in FILE_ORDER is appended at the end alphabetically.
Override it from outside in two ways:
1. CLI flag — pass --file-order path/to/file_order.json
2. Auto-detected — place a file_order.json in your data directory:
[
"E-Human-AI Teams-Intro.docx",
"E-Human-AI Teams-2.docx",
"T-Society-1.docx"
]
If neither is provided, the built-in list from tstools/__init__.py is used.
Incremental deduplication
dedup_map.json is a persistent registry — unique IDs are permanent once assigned.
On each run:
- Citations already in the registry are carried forward unchanged.
- New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
- Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
- Existing numbers are never reassigned.
Human review
Fuzzy matches (≥ 0.70 similarity) are flagged review_needed in the map. After reviewing:
| Decision | Edit in dedup_map.json |
|---|---|
| Confirmed duplicate | Set duplicate_of, match_type: "manual" |
| Confirmed distinct | Set review_flag: "confirmed_distinct" |
| Manual ID assignment | Set unique_id to desired number, match_type: "manual" |
The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same unique_id.
Outputs
| File | Contents |
|---|---|
citations.csv |
All citations + validation issues + dedup metadata |
bibliography.csv |
UniqueID → Citation → SourceIDs |
dedup_map.json |
Persistent registry (append-only) |
output/issues.md |
Human work queue — validation issues by file |
output/*.docx |
Inline citations replaced, References section removed |
Validation (AMA 11th ed.)
Runs on every present file each run. Issues appear in output/issues.md until fixed in the source .docx.
| # | Check | Code |
|---|---|---|
| — | Bare URL | url_only |
| 1 | Author | missing_author |
| 2 | Title | missing_title |
| 3 | Source | missing_source |
| 4 | Year | missing_year |
| 5 | Locator (DOI / URL / vol-page) | missing_locator |
| 6 | Accessed date when URL, no DOI | missing_accessed |
| 7 | No accessed date when DOI | unnecessary_accessed |
Deduplication phases
- URL / DOI — same locator → definitive duplicate
- Exact text — normalised match → definitive duplicate
- Title — normalised title segment match → definitive duplicate
- Fuzzy ≥ 0.70 — flagged
review_needed, not auto-merged
Inline citation replacement
Input formats supported (from source .docx):
| Format | Example input | Output |
|---|---|---|
| Brackets — single | [1] |
[42] |
| Brackets — list | [1,2] |
[18,27] |
| Brackets — range | [1-4] |
[42-45] |
| Superscript — single | ⁵ | [42] |
| Superscript — list | ²˒³ | [18,27] |
All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved ([1]. → [42].).
File naming
E-Human-AI Teams-2.docx → E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx → T-Society-1-1, T-Society-1-2, …
Structure
tstools/
├── __init__.py paths + FILE_ORDER
├── main.py orchestrator
├── utils/
│ ├── utils.py extract, clean, save, verify, registry load/save
│ ├── inline_replacer.py bracket + superscript → bracket replacement
│ └── cite_from_url.py URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│ └── validator.py AMA format checks
└── unique/
└── deduplicator.py four-phase incremental dedup
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cdtm_tstools-0.1.4.tar.gz.
File metadata
- Download URL: cdtm_tstools-0.1.4.tar.gz
- Upload date:
- Size: 265.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1de02bf05c2e583837c8457c40360d67d20dbda91ee59b90c7461b67d1a40a93
|
|
| MD5 |
31d3a143eb88163aee5775ba19f93743
|
|
| BLAKE2b-256 |
a52b7e9f4bed62ca42c7f57305d2dace0ffd2ed087b713810feef757061bffb5
|
File details
Details for the file cdtm_tstools-0.1.4-py3-none-any.whl.
File metadata
- Download URL: cdtm_tstools-0.1.4-py3-none-any.whl
- Upload date:
- Size: 25.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0566e7f770e4178d3df177a719c6154d5dc1746735bd34ce624284ec9bf6d3e6
|
|
| MD5 |
fa3479d0b9afaef9824763756718b752
|
|
| BLAKE2b-256 |
1794bfd57fc733faadd2587807a761d719efc40013365888c26258292a4fdb21
|