Citation pipeline for CDTM trend seminars

These details have not been verified by PyPI

Project links

Repository

Project description

Spring 26 — Citation Pipeline

Extract, validate, deduplicate, and replace inline citations across .docx trend-phase files.

Pipeline

flowchart TD
    A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
    B --> C["Validate (AMA 11th)"]
    C --> D["Incremental dedup(check registry first)"]
    D --> E["Save CSV / JSON / dedup_map"]
    E --> F{"Ready?"}
    F -->|yes / FORCE| G["Replace inline citations"]
    F -->|no| H["Fix issues → re-run"]
    G --> I["output/*.docx"]

Quick start

uv venv
uv sync
python -m tstools.main

Paths in tstools/__init__.py, flags in tstools/main.py:

DO_REPLACE   = True   # skip inline replacement when False
FORCE_OUTPUT = False  # override verification gate

File order

FILE_ORDER in tstools/__init__.py defines the canonical processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in FILE_ORDER is appended at the end alphabetically.

FILE_ORDER = [
    "T-Technology-1.docx",
    "E-Blah-2.docx",
    # …
]

Incremental deduplication

dedup_map.json is a persistent registry — unique IDs are permanent once assigned.

On each run:

Citations already in the registry are carried forward unchanged.
New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
Existing numbers are never reassigned.

Human review

Fuzzy matches (≥ 0.70 similarity) are flagged review_needed in the map. After reviewing:

Decision	Edit in `dedup_map.json`
Confirmed duplicate	Set `duplicate_of`, `match_type: "manual"`
Confirmed distinct	Set `review_flag: "confirmed_distinct"`
Manual ID assignment	Set `unique_id` to desired number, `match_type: "manual"`

The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same unique_id.

Outputs

File	Contents
`citations.csv`	All citations + validation issues + dedup metadata
`bibliography.csv`	UniqueID → Citation → SourceIDs
`dedup_map.json`	Persistent registry (append-only)
`output/issues.md`	Human work queue — validation issues by file
`output/*.docx`	Inline citations replaced, References section removed

Validation (AMA 11th ed.)

Runs on every present file each run. Issues appear in output/issues.md until fixed in the source .docx.

#	Check	Code
—	Bare URL	`url_only`
1	Author	`missing_author`
2	Title	`missing_title`
3	Source	`missing_source`
4	Year	`missing_year`
5	Locator (DOI / URL / vol-page)	`missing_locator`
6	Accessed date when URL, no DOI	`missing_accessed`
7	No accessed date when DOI	`unnecessary_accessed`

Deduplication phases

URL / DOI — same locator → definitive duplicate
Exact text — normalised match → definitive duplicate
Title — normalised title segment match → definitive duplicate
Fuzzy ≥ 0.70 — flagged review_needed, not auto-merged

Inline citation replacement

Input formats supported (from source .docx):

Format	Example input	Output
Brackets — single	`[1]`	`[42]`
Brackets — list	`[1,2]`	`[18,27]`
Brackets — range	`[1-4]`	`[42-45]`
Superscript — single	⁵	`[42]`
Superscript — list	²˒³	`[18,27]`

All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved ([1]. → [42].).

File naming

E-Human-AI Teams-2.docx  →  E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx          →  T-Society-1-1, T-Society-1-2, …

Structure

tstools/
├── __init__.py              paths + FILE_ORDER
├── main.py                  orchestrator
├── utils/
│   ├── utils.py             extract, clean, save, verify, registry load/save
│   ├── inline_replacer.py   bracket + superscript → bracket replacement
│   └── cite_from_url.py     URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│   └── validator.py         AMA format checks
└── unique/
    └── deduplicator.py      four-phase incremental dedup

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.1.5

Apr 1, 2026

0.1.4

Mar 31, 2026

0.1.3

Mar 31, 2026

0.1.2

Mar 29, 2026

This version

0.1.1

Mar 29, 2026

0.1.0

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdtm_tstools-0.1.1.tar.gz (17.2 MB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cdtm_tstools-0.1.1-py3-none-any.whl (20.6 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file cdtm_tstools-0.1.1.tar.gz.

File metadata

Download URL: cdtm_tstools-0.1.1.tar.gz
Upload date: Mar 29, 2026
Size: 17.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.19

File hashes

Hashes for cdtm_tstools-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`7ce490fa6743ed468beeb0beb34250fb2456940e3662ad8a0ff01df19273ebc3`
MD5	`29bb5c57515a56e591580fb8ec957b06`
BLAKE2b-256	`14c58c6243d380f4bd0cfa46d331a7efee468dd3a12ed306f67ee695ffcc23f4`

See more details on using hashes here.

File details

Details for the file cdtm_tstools-0.1.1-py3-none-any.whl.

File metadata

Download URL: cdtm_tstools-0.1.1-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 20.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.19

File hashes

Hashes for cdtm_tstools-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`10b3920d590f46171f59b2f78d35229de90ee2347a0d54e6af41627f8e356a24`
MD5	`691c1b5966dabc3a0e86713e617331af`
BLAKE2b-256	`77639a91b6c455235716cbce2944922c907d90694800356be21d632ce9e6b252`

See more details on using hashes here.

cdtm-tstools 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spring 26 — Citation Pipeline

Pipeline

Quick start

File order

Incremental deduplication

Human review

Outputs

Validation (AMA 11th ed.)

Deduplication phases

Inline citation replacement

File naming

Structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes