Citation pipeline for CDTM trend seminars

These details have not been verified by PyPI

Project links

Repository

Project description

Spring 26 — Citation Pipeline

Extract, validate, deduplicate, and replace inline citations across .docx trend-phase files.

Pipeline

flowchart TD
    A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
    B --> C["Validate (AMA 11th)"]
    C --> D["Incremental dedup(check registry first)"]
    D --> E["Save CSV / JSON / dedup_map"]
    E --> F{"Ready?"}
    F -->|yes / FORCE| G["Replace inline citations"]
    F -->|no| H["Fix issues → re-run"]
    G --> I["output/*.docx"]

Installation

pip install cdtm-tstools

Or for local development:

uv venv
uv sync

Usage

Run from a directory containing a data/ folder with your .docx files:

# Basic run (validate + deduplicate, no replacement)
tstools

# Point to a different data folder
tstools --data-dir data/fall26

# Enable inline citation replacement
tstools --replace

# Force replacement even with unresolved issues
tstools --replace --force

# Custom file processing order
tstools --file-order my_order.json

# All together
tstools --data-dir data/fall26 --replace --force --file-order my_order.json

Flag	Description
`--data-dir PATH`	Data directory (default: `data/spring26`)
`--replace`	Run inline citation replacement
`--force`	Proceed with replacement despite unresolved issues
`--file-order PATH`	JSON file with the file processing order

Equivalent module invocations: python -m tstools or python -m tstools.main.

File order

FILE_ORDER in tstools/__init__.py defines the default processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in FILE_ORDER is appended at the end alphabetically.

Override it from outside in two ways:

1. CLI flag — pass --file-order path/to/file_order.json

2. Auto-detected — place a file_order.json in your data directory:

[
    "E-Human-AI Teams-Intro.docx",
    "E-Human-AI Teams-2.docx",
    "T-Society-1.docx"
]

If neither is provided, the built-in list from tstools/__init__.py is used.

Incremental deduplication

dedup_map.json is a persistent registry — unique IDs are permanent once assigned.

On each run:

Citations already in the registry are carried forward unchanged.
New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
Existing numbers are never reassigned.

Human review

Fuzzy matches (≥ 0.70 similarity) are flagged review_needed in the map. After reviewing:

Decision	Edit in `dedup_map.json`
Confirmed duplicate	Set `duplicate_of`, `match_type: "manual"`
Confirmed distinct	Set `review_flag: "confirmed_distinct"`
Manual ID assignment	Set `unique_id` to desired number, `match_type: "manual"`

The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same unique_id.

Outputs

File	Contents
`citations.csv`	All citations + validation issues + dedup metadata
`bibliography.csv`	UniqueID → Citation → SourceIDs
`dedup_map.json`	Persistent registry (append-only)
`output/issues.md`	Human work queue — validation issues by file
`output/*.docx`	Inline citations replaced, References section removed

Validation (AMA 11th ed.)

Runs on every present file each run. Issues appear in output/issues.md until fixed in the source .docx.

#	Check	Code
—	Bare URL	`url_only`
1	Author	`missing_author`
2	Title	`missing_title`
3	Source	`missing_source`
4	Year	`missing_year`
5	Locator (DOI / URL / vol-page)	`missing_locator`
6	Accessed date when URL, no DOI	`missing_accessed`
7	No accessed date when DOI	`unnecessary_accessed`

Deduplication phases

URL / DOI — same locator → definitive duplicate
Exact text — normalised match → definitive duplicate
Title — normalised title segment match → definitive duplicate
Fuzzy ≥ 0.70 — flagged review_needed, not auto-merged

Inline citation replacement

Input formats supported (from source .docx):

Format	Example input	Output
Brackets — single	`[1]`	`[42]`
Brackets — list	`[1,2]`	`[18,27]`
Brackets — range	`[1-4]`	`[42-45]`
Superscript — single	⁵	`[42]`
Superscript — list	²˒³	`[18,27]`

All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved ([1]. → [42].).

File naming

E-Human-AI Teams-2.docx  →  E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx          →  T-Society-1-1, T-Society-1-2, …

Run tracking and freezing

Each pipeline run is logged in data/<semester>/logs/run_log.json. After a run, you must verify the outputs and set the verified flags to true in run_log.json. The next run then automatically marks the previous entry as frozen.

Flag	What to check
`verified.duplicates`	Review `output/duplicates.md`
`verified.not_used`	Review `output/not_used.md`
`verified.inline_substitution`	Spot-check `output/*.docx` files

Frozen runs protect UIDs — citations from frozen runs keep their assigned numbers forever, and the pipeline warns if their bibliography entries are modified. Unfrozen UIDs are reassigned in body-text order on every run, so UIDs stay contiguous as long as nothing is frozen.

Use --skip-gate to bypass the verification gate during iterative development.

Known bugs fixed (v0.2)

1. UIDs assigned in reference-list order instead of body-text order

Symptom: The first citation in the body text (e.g., [23]) got a high UID like 75, while [1] (which appears late in the text) got UID 61.

Root cause: deduplicate() assigns temporary UIDs in reference-list order (the order entries appear in the References section). reorder_unique_ids() was then pre-seeding from dedup_results, which already had every UID filled in — making the body-text reorder a complete no-op.

Fix: reorder_unique_ids() now only pre-seeds UIDs from frozen runs (via frozen_uids parameter). All other UIDs are discarded and reassigned in the order citations first appear in body text across files in FILE_ORDER.

2. UID gaps after merging duplicates

Symptom: After manually flagging a citation as a duplicate (e.g., merging UID 40 into UID 1), UIDs went 1–39, 41–84 with a gap at 40.

Root cause: Same as bug #1 — all previous UIDs were treated as sticky regardless of frozen status. UIDs 41–84 were preserved at their old values instead of shifting down.

Fix: Only UIDs from frozen runs are sticky. When nothing is frozen, every re-run produces contiguous UIDs (1, 2, 3, ...) with no gaps.

3. Split-run bracket citations not detected or replaced

Symptom: Some inline citations like [5,10] or [11,12] were silently skipped during both body-text scanning and replacement. The output .docx still contained unreplaced local numbers.

Root cause: Word internally splits text into runs (formatting spans). A single [5,10] can be stored as three separate runs: [5, / 10 / ]. The pipeline's regex (\[(\d[\d,\s\-]*)\]) scans one run at a time and needs the full bracket pattern in a single string. Split brackets never matched.

Fix: Added _collapse_split_brackets() — a state-machine pre-pass that walks paragraph runs, detects incomplete bracket patterns at run boundaries, and merges them into single runs before replacement. Handles patterns like:

[5, / 10 / ] → [5,10]
[11 / ,12]. → [11,12].
[ / 10 / ] → [10]
Chained splits where one merge exposes another (e.g., ]...text [4, / 9 / , / 15 / ].)

4. DOIs not extracted when on a separate line

Symptom: Citations with a DOI on its own line (after a line break within the same reference entry) were flagged missing_locator even though the DOI was present.

Root cause: The extraction logic treated each line independently. When a citation's text ended on one line and the DOI started on the next, the DOI line was discarded as a non-citation line.

Fix: extract_citations_from_file() now merges continuation lines matching BARE_DOI_RE (standalone DOI patterns like doi:10.xxxx/...) back into the preceding citation entry.

5. DOIs stripped by accessed-date cleanup

Symptom: A citation had both an "Accessed" date and a DOI (e.g., ...Accessed March 6, 2026. https://doi.org/...). The auto-fix for "unnecessary accessed date when DOI present" stripped the accessed date and the DOI/URL that followed it.

Root cause: ACCESSED_TAIL_RE matched greedily from "Accessed" through the end of the string, removing everything — including the locator.

Fix: fix_citations() now extracts and re-appends any DOI or URL locator found in the stripped tail before discarding the accessed date portion.

6. Double periods after inline citations

Symptom: Some replaced citations produced [42].. (two periods) instead of [42]..

Root cause: When Word stored the closing ] and . in separate runs, the bracket collapse step merged them in a way that preserved the original period while the run that followed also contributed one.

Fix: The collapse logic now correctly transfers trailing punctuation from consumed runs so that no character is duplicated.

7. Body-text scan misordered split-run citations

Symptom: UIDs for a file were not sequential in body-text order. E.g., in T_Legal_5 the first citation in the text ([3, 6]) got UIDs 224/225 while a later citation ([4]) got UID 221.

Root cause: _scan_body_order() had the same split-run problem as bug #3, but in the scanning phase. It processed intact per-run brackets first (finding [4], [8], etc.), then fell back to para.text for split brackets ([3, 6]). Since [3, 6] was split across runs ([3, / 6 / ]), it was added to the order after all intact brackets — even though it appears first in the text.

Fix: _scan_body_order() now calls _collapse_split_brackets() on each paragraph's runs before scanning, exactly like replace_inline_citations() does. This merges split brackets into single runs so the per-run scan finds them in their correct text position.

Structure

tstools/
├── __init__.py              paths + FILE_ORDER
├── main.py                  orchestrator
├── patterns.py              centralized regex patterns
├── runs.py                  run tracking, verification gates, freeze logic
├── utils/
│   ├── utils.py             extract, clean, save, verify, registry load/save
│   ├── inline_replacer.py   bracket + superscript → bracket replacement
│   └── cite_from_url.py     URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│   └── validator.py         AMA format checks
└── unique/
    └── deduplicator.py      four-phase incremental dedup

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.5

Apr 1, 2026

0.1.4

Mar 31, 2026

0.1.3

Mar 31, 2026

0.1.2

Mar 29, 2026

0.1.1

Mar 29, 2026

0.1.0

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdtm_tstools-0.1.5.tar.gz (1.5 MB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cdtm_tstools-0.1.5-py3-none-any.whl (32.1 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file cdtm_tstools-0.1.5.tar.gz.

File metadata

Download URL: cdtm_tstools-0.1.5.tar.gz
Upload date: Apr 1, 2026
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.19

File hashes

Hashes for cdtm_tstools-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`3b1eb6ef50f1dac831e62bece8054c85042cd8941c883688341cddf5e7e42545`
MD5	`3911a725266c6ee23f6bc1bbb14392f4`
BLAKE2b-256	`eee2a0ec3a001fa6cbcebfd9370d16aa0b510ad3ae9a90b23950ef62d26617f7`

See more details on using hashes here.

File details

Details for the file cdtm_tstools-0.1.5-py3-none-any.whl.

File metadata

Download URL: cdtm_tstools-0.1.5-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 32.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.19

File hashes

Hashes for cdtm_tstools-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a07bc6e315387312f7b7a9cd60ae95bd4593bde0b6c5b79c7a697be6b38bb1b9`
MD5	`5b6a574d8d4b2e24e656f208ea0de7e2`
BLAKE2b-256	`215275b7480b7b5938a976867162ea94f8b0cd29948fb245883c240a1cdb30c4`

See more details on using hashes here.

cdtm-tstools 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spring 26 — Citation Pipeline

Pipeline

Installation

Usage

File order

Incremental deduplication

Human review

Outputs

Validation (AMA 11th ed.)

Deduplication phases

Inline citation replacement

File naming

Run tracking and freezing

Known bugs fixed (v0.2)

1. UIDs assigned in reference-list order instead of body-text order

2. UID gaps after merging duplicates

3. Split-run bracket citations not detected or replaced

4. DOIs not extracted when on a separate line

5. DOIs stripped by accessed-date cleanup

6. Double periods after inline citations

7. Body-text scan misordered split-run citations

Structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes