Ingest sources with proper citation — PDF, URL, media, Office, DJVU

Project description

CiteIndex

v0.12.0 — Ingest sources with proper citation. PDF, URL, media, Office, DJVU.

Deterministic citation extraction, Merkle-verified integrity, CJK-first OCR. Every claim is traced, verified, and cited — no hallucinations.

Install

# Using rye (recommended)
rye sync

# Or pip
pip install -e .

CLI

# Ingest a PDF
citeindex paper.pdf

# Ingest a URL
citeindex https://example.com/article

# Crawl and ingest all articles from a site
citeindex https://example.com/articles --all-url-article --crawl-depth 2

# Crawl and re-ingest only changed pages
citeindex https://example.com/articles --update-url-article

# Options
citeindex paper.pdf --llm ollama/qwen3 --type thesis --is-primary
citeindex paper.pdf --text-direction vertical --vertical-lang ch
citeindex scanned.pdf --lang auto --page-range "1-10"
citeindex paper.pdf --no-layout  # disable column/footnote detection

Python API

from citeindex import ingest, IngestionConfig

# Simple
result = ingest("paper.pdf")
print(result["status"])  # "ok"

# With config
config = IngestionConfig(
    llm_model="ollama/qwen3",
    text_direction="vertical",
    is_primary=True,
)
result = ingest("paper.pdf", corpus_root="my_corpus", config=config)

Ingestion Pipelines

CiteIndex automatically detects the input type and routes to the correct pipeline:

Digital PDF

PDF → GROBID (metadata) → MinerU (layout) → DSPy reconciliation
    → document structure (pages/columns/paragraphs/lines)
    → Merkle tree → store to corpus/

GROBID extracts metadata and references deterministically
MinerU performs layout analysis (columns, footnotes, tables)
DSPy reconciles GROBID output with pattern extraction as fallback
Builds section-hierarchical document structure with actual page numbers

Scanned PDF

PDF → OCRmyPDF (normalize) → PaddleOCR (vertical detect) → MinerU (layout)
    → Tesseract (text) → GROBID (citations) → document structure
    → Merkle tree → store to corpus/

OCRmyPDF normalizes and adds text layer to scanned pages
PaddleOCR detects CJK vertical text layouts
Tesseract provides OCR with auto-detected language
Supports --text-direction vertical for traditional Chinese/Japanese

URL Article

URL → Playwright/requests (fetch) → trafilatura (content)
    → Zotero (metadata) → CSL JSON → deterministic chunking
    → hashes → Merkle tree → store to corpus/

Playwright renders JavaScript-heavy pages (fallback to requests)
trafilatura extracts clean text with heading structure
Zotero extracts citation metadata (title, authors, date, DOI)
Discovers in-page citation guidance (若要引用 / Cite this / etc.)
Supports batch crawling with --all-url-article and --update-url-article

Media

URL/File → yt-dlp (download) → ffmpeg (audio) → WhisperX (transcription)
        → pyannote (diarization, optional) → CSL JSON
        → chunking → hashes → Merkle tree → store to corpus/

yt-dlp downloads from YouTube, Vimeo, podcasts, etc.
WhisperX transcribes with word-level timestamps
pyannote speaker diarization (optional)
Supports audio (.mp3, .wav, .m4a) and video (.mp4, .mkv, .webm)

Office & DJVU

Office documents (.docx, .doc, .rtf, .odt, .pptx, .ppt, .odp) and DJVU (.djvu) are converted to PDF via LibreOffice/ddjvu, then routed to the digital or scanned PDF pipeline.

Configuration Reference

Option	CLI Flag	Default	Description
`llm_model`	`--llm`	`ollama/qwen3`	LLM model for citation extraction
`text_direction`	`--text-direction`, `-td`	`horizontal`	`horizontal`, `auto`, or `vertical`
`vertical_lang`	`--vertical-lang`	`ch`	CJK language: `ch` (Chinese) or `japan`
`lang`	`--lang`, `-l`	`auto`	OCR language (auto-detect or Tesseract code)
`page_range`	`--page-range`, `-p`	`1-5, -3`	Pages to extract (e.g. `"1-10"`, `"1-5, -3"`)
`doc_type_override`	`--type`, `-t`	auto	`book`, `thesis`, `journal`, or `bookchapter`
`use_layout_analysis`	`--no-layout`	`True`	Disable column/footnote detection
`is_primary`	`--is-primary`	`False`	Line-level granularity (vs paragraph-level)
`use_pageindex`	`--use-pageindex`	`False`	LLM-driven section hierarchy (requires Ollama)
`pageindex_model`	`--pageindex-model`	`ollama/qwen3.5:cloud`	LLM for PageIndex tree building
`citation_style`	(API only)	`chicago-author-date`	CSL citation style for output
`corpus_root`	`--corpus-root`	`corpus`	Output directory for ingested artifacts
`schema_version`	`--schema-version`	`1.0.0`	Output schema version tag

Output

Each ingestion produces a corpus folder (e.g., corpus/Author_2024_Title/) containing:

File	Description
`csl.json`	Citation metadata (CSL-JSON with `ci_*` extensions: `content_hash`, `merkle_root`, `source_type`, `ingestion_timestamp`)
`document.json`	Structured document tree (PageIndex) — sections, pages, paragraphs, lines
`merkle.json`	SHA-256 Merkle tree for integrity verification
`ingestion_output.json`	Full ingestion result with all pipeline outputs
`library.md`	Human-readable citation with extracted text and footnotes

Return Value

The ingest() function returns a dict:

{
    "status": "ok",                    # "ok" or "blocked"
    "document_path": "corpus/Author_2024_Title",
    "standardized_csl_json": { ... },  # Full CSL-JSON with ci_ extensions
    "sub_pipeline_outputs": { ... },   # Raw pipeline results
    "ingestion_log_entry": { ... },     # Log entry with merkle_root
}

# On failure:
{
    "status": "blocked",
    "stage": "detect_resource_type",
    "error_code": "unsupported_input",
    "error_message": "Unsupported input: ...",
    "next_action": "Provide PDF, URL, or media file",
}

Supported Formats

Format	Extension / Protocol
Digital PDF	`.pdf` (with embedded text)
Scanned PDF	`.pdf` (image-based, OCR applied)
URL Article	`http://` / `https://`
Media	`.mp3`, `.wav`, `.m4a`, `.mp4`, `.mkv`, `.webm`
Office	`.docx`, `.doc`, `.rtf`, `.odt`, `.pptx`, `.ppt`, `.odp`
DJVU	`.djvu`

License

MIT

Project details

Release history Release notifications | RSS feed

0.12.5

May 4, 2026

0.12.4

May 4, 2026

0.12.2

May 4, 2026

0.12.1

May 2, 2026

This version

0.12.0

Apr 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citeindex-0.12.0.tar.gz (486.7 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

citeindex-0.12.0-py3-none-any.whl (133.0 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file citeindex-0.12.0.tar.gz.

File metadata

Download URL: citeindex-0.12.0.tar.gz
Upload date: Apr 27, 2026
Size: 486.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.13.1

File hashes

Hashes for citeindex-0.12.0.tar.gz
Algorithm	Hash digest
SHA256	`af5532e04b3aa4f1bb8a6f8ef788c669778bd8d993ea7e0a6c0f213d794f4045`
MD5	`d5484b8c9c2897297dfecc23e428e85e`
BLAKE2b-256	`646550a015d10d2d2e550a78cfada94ee5a1c83e1f9c9c4e1077034dd367a1ef`

See more details on using hashes here.

File details

Details for the file citeindex-0.12.0-py3-none-any.whl.

File metadata

Download URL: citeindex-0.12.0-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 133.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.13.1

File hashes

Hashes for citeindex-0.12.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`658916ba06e784c8e3085b044240b098657f6563d72ada3bb6755ec99c37f3db`
MD5	`ca06f4a6748b2d0fcaa7ea737aba3e29`
BLAKE2b-256	`7982769814247214f2404b365d792e1366c26b153f946062984a6f0a87a27553`

See more details on using hashes here.

citeindex 0.12.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

CiteIndex

Install

CLI

Python API

Ingestion Pipelines

Digital PDF

Scanned PDF

URL Article

Media

Office & DJVU

Configuration Reference

Output

Return Value

Supported Formats

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes