Skip to main content

Citation pipeline for CDTM trend seminars

Project description

Spring 26 — Citation Pipeline

Extract, validate, deduplicate, and replace inline citations across .docx trend-phase files.

Pipeline

flowchart TD
    A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
    B --> C["Validate (AMA 11th)"]
    C --> D["Incremental dedup(check registry first)"]
    D --> E["Save CSV / JSON / dedup_map"]
    E --> F{"Ready?"}
    F -->|yes / FORCE| G["Replace inline citations"]
    F -->|no| H["Fix issues → re-run"]
    G --> I["output/*.docx"]

Installation

pip install cdtm-tstools

Or for local development:

uv venv
uv sync

Usage

Run from a directory containing a data/ folder with your .docx files:

# Basic run (validate + deduplicate, no replacement)
tstools

# Point to a different data folder
tstools --data-dir data/fall26

# Enable inline citation replacement
tstools --replace

# Force replacement even with unresolved issues
tstools --replace --force

# Custom file processing order
tstools --file-order my_order.json

# All together
tstools --data-dir data/fall26 --replace --force --file-order my_order.json
Flag Description
--data-dir PATH Data directory (default: data/spring26)
--replace Run inline citation replacement
--force Proceed with replacement despite unresolved issues
--file-order PATH JSON file with the file processing order

Equivalent module invocations: python -m tstools or python -m tstools.main.

File order

FILE_ORDER in tstools/__init__.py defines the default processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in FILE_ORDER is appended at the end alphabetically.

Override it from outside in two ways:

1. CLI flag — pass --file-order path/to/file_order.json

2. Auto-detected — place a file_order.json in your data directory:

[
    "E-Human-AI Teams-Intro.docx",
    "E-Human-AI Teams-2.docx",
    "T-Society-1.docx"
]

If neither is provided, the built-in list from tstools/__init__.py is used.

Incremental deduplication

dedup_map.json is a persistent registry — unique IDs are permanent once assigned.

On each run:

  • Citations already in the registry are carried forward unchanged.
  • New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
  • Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
  • Existing numbers are never reassigned.

Human review

Fuzzy matches (≥ 0.70 similarity) are flagged review_needed in the map. After reviewing:

Decision Edit in dedup_map.json
Confirmed duplicate Set duplicate_of, match_type: "manual"
Confirmed distinct Set review_flag: "confirmed_distinct"
Manual ID assignment Set unique_id to desired number, match_type: "manual"

The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same unique_id.

Outputs

File Contents
citations.csv All citations + validation issues + dedup metadata
bibliography.csv UniqueID → Citation → SourceIDs
dedup_map.json Persistent registry (append-only)
output/issues.md Human work queue — validation issues by file
output/*.docx Inline citations replaced, References section removed

Validation (AMA 11th ed.)

Runs on every present file each run. Issues appear in output/issues.md until fixed in the source .docx.

# Check Code
Bare URL url_only
1 Author missing_author
2 Title missing_title
3 Source missing_source
4 Year missing_year
5 Locator (DOI / URL / vol-page) missing_locator
6 Accessed date when URL, no DOI missing_accessed
7 No accessed date when DOI unnecessary_accessed

Deduplication phases

  1. URL / DOI — same locator → definitive duplicate
  2. Exact text — normalised match → definitive duplicate
  3. Title — normalised title segment match → definitive duplicate
  4. Fuzzy ≥ 0.70 — flagged review_needed, not auto-merged

Inline citation replacement

Input formats supported (from source .docx):

Format Example input Output
Brackets — single [1] [42]
Brackets — list [1,2] [18,27]
Brackets — range [1-4] [42-45]
Superscript — single [42]
Superscript — list ²˒³ [18,27]

All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved ([1].[42].).

File naming

E-Human-AI Teams-2.docx  →  E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx          →  T-Society-1-1, T-Society-1-2, …

Structure

tstools/
├── __init__.py              paths + FILE_ORDER
├── main.py                  orchestrator
├── utils/
│   ├── utils.py             extract, clean, save, verify, registry load/save
│   ├── inline_replacer.py   bracket + superscript → bracket replacement
│   └── cite_from_url.py     URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│   └── validator.py         AMA format checks
└── unique/
    └── deduplicator.py      four-phase incremental dedup

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdtm_tstools-0.1.3.tar.gz (300.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cdtm_tstools-0.1.3-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file cdtm_tstools-0.1.3.tar.gz.

File metadata

  • Download URL: cdtm_tstools-0.1.3.tar.gz
  • Upload date:
  • Size: 300.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for cdtm_tstools-0.1.3.tar.gz
Algorithm Hash digest
SHA256 bd7c6ad476c816121854585b030af72898380bf7f479e39ae1e4c30cd68f21e5
MD5 c59aef112d4686847ebd8ca8ec0e82ba
BLAKE2b-256 e21168642278953772af11d64309145f74d409643791194fd6d328c7dbbbe336

See more details on using hashes here.

File details

Details for the file cdtm_tstools-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for cdtm_tstools-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7d576a9044ad94da5de6d1f85c59d3ff80eac343c43266edcf982ee33f71e39d
MD5 ac24d3bf21c1cc32a5f5a475bdcb58d0
BLAKE2b-256 7ed5d97e0cc2a02e6e45eae246d2114b92be0c05fa8353d169c2024ae6b054b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page