Skip to main content

Citation pipeline for CDTM trend seminars

Project description

Spring 26 — Citation Pipeline

Extract, validate, deduplicate, and replace inline citations across .docx trend-phase files.

Pipeline

flowchart TD
    A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
    B --> C["Validate (AMA 11th)"]
    C --> D["Incremental dedup(check registry first)"]
    D --> E["Save CSV / JSON / dedup_map"]
    E --> F{"Ready?"}
    F -->|yes / FORCE| G["Replace inline citations"]
    F -->|no| H["Fix issues → re-run"]
    G --> I["output/*.docx"]

Installation

pip install cdtm-tstools

Or for local development:

uv venv
uv sync

Usage

Run from a directory containing a data/ folder with your .docx files:

# Basic run (validate + deduplicate, no replacement)
tstools

# Point to a different data folder
tstools --data-dir data/fall26

# Enable inline citation replacement
tstools --replace

# Force replacement even with unresolved issues
tstools --replace --force

# Custom file processing order
tstools --file-order my_order.json

# All together
tstools --data-dir data/fall26 --replace --force --file-order my_order.json
Flag Description
--data-dir PATH Data directory (default: data/spring26)
--replace Run inline citation replacement
--force Proceed with replacement despite unresolved issues
--file-order PATH JSON file with the file processing order

Equivalent module invocations: python -m tstools or python -m tstools.main.

File order

FILE_ORDER in tstools/__init__.py defines the default processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in FILE_ORDER is appended at the end alphabetically.

Override it from outside in two ways:

1. CLI flag — pass --file-order path/to/file_order.json

2. Auto-detected — place a file_order.json in your data directory:

[
    "E-Human-AI Teams-Intro.docx",
    "E-Human-AI Teams-2.docx",
    "T-Society-1.docx"
]

If neither is provided, the built-in list from tstools/__init__.py is used.

Incremental deduplication

dedup_map.json is a persistent registry — unique IDs are permanent once assigned.

On each run:

  • Citations already in the registry are carried forward unchanged.
  • New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
  • Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
  • Existing numbers are never reassigned.

Human review

Fuzzy matches (≥ 0.70 similarity) are flagged review_needed in the map. After reviewing:

Decision Edit in dedup_map.json
Confirmed duplicate Set duplicate_of, match_type: "manual"
Confirmed distinct Set review_flag: "confirmed_distinct"
Manual ID assignment Set unique_id to desired number, match_type: "manual"

The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same unique_id.

Outputs

File Contents
citations.csv All citations + validation issues + dedup metadata
bibliography.csv UniqueID → Citation → SourceIDs
dedup_map.json Persistent registry (append-only)
output/issues.md Human work queue — validation issues by file
output/*.docx Inline citations replaced, References section removed

Validation (AMA 11th ed.)

Runs on every present file each run. Issues appear in output/issues.md until fixed in the source .docx.

# Check Code
Bare URL url_only
1 Author missing_author
2 Title missing_title
3 Source missing_source
4 Year missing_year
5 Locator (DOI / URL / vol-page) missing_locator
6 Accessed date when URL, no DOI missing_accessed
7 No accessed date when DOI unnecessary_accessed

Deduplication phases

  1. URL / DOI — same locator → definitive duplicate
  2. Exact text — normalised match → definitive duplicate
  3. Title — normalised title segment match → definitive duplicate
  4. Fuzzy ≥ 0.70 — flagged review_needed, not auto-merged

Inline citation replacement

Input formats supported (from source .docx):

Format Example input Output
Brackets — single [1] [42]
Brackets — list [1,2] [18,27]
Brackets — range [1-4] [42-45]
Superscript — single [42]
Superscript — list ²˒³ [18,27]

All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved ([1].[42].).

File naming

E-Human-AI Teams-2.docx  →  E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx          →  T-Society-1-1, T-Society-1-2, …

Structure

tstools/
├── __init__.py              paths + FILE_ORDER
├── main.py                  orchestrator
├── utils/
│   ├── utils.py             extract, clean, save, verify, registry load/save
│   ├── inline_replacer.py   bracket + superscript → bracket replacement
│   └── cite_from_url.py     URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│   └── validator.py         AMA format checks
└── unique/
    └── deduplicator.py      four-phase incremental dedup

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdtm_tstools-0.1.2.tar.gz (496.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cdtm_tstools-0.1.2-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file cdtm_tstools-0.1.2.tar.gz.

File metadata

  • Download URL: cdtm_tstools-0.1.2.tar.gz
  • Upload date:
  • Size: 496.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for cdtm_tstools-0.1.2.tar.gz
Algorithm Hash digest
SHA256 77fc464198d5c5954af84926debb53f24d40d0aad53d7e9b29bef28cf2848a3a
MD5 c56da2ab7789db2accf0d4f1b6a0a91e
BLAKE2b-256 02326ed7894ab4bd2e89efcb76c6ede4a16b482c57248a69c443b3d81a6f9396

See more details on using hashes here.

File details

Details for the file cdtm_tstools-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for cdtm_tstools-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c33776bd3dca23e126c3f1372a8b8c768cf71ce69fbcd9b1828243a4e4717675
MD5 3abbb74e57745ef0e373ad4b5b9eb9b4
BLAKE2b-256 5c8dac81b24d6477cfff0ab3e26bfb2042a5b7736d9faa4dbda0200bb3ffdb87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page