Ingest sources with proper citation — PDF, URL, media, Office, DJVU
Project description
CiteIndex
v0.12.0 — Ingest sources with proper citation. PDF, URL, media, Office, DJVU.
Deterministic citation extraction, Merkle-verified integrity, CJK-first OCR. Every claim is traced, verified, and cited — no hallucinations.
Install
# Using uv (recommended)
uv pip install citeindex
# Or pip
pip install citeindex
CLI
# Ingest a PDF
citeindex paper.pdf
# Ingest a URL
citeindex https://example.com/article
# Crawl and ingest all articles from a site
citeindex https://example.com/articles --all-url-article --crawl-depth 2
# Crawl and re-ingest only changed pages
citeindex https://example.com/articles --update-url-article
# Options
citeindex paper.pdf --llm ollama/qwen3 --type thesis --is-primary
citeindex paper.pdf --text-direction vertical --vertical-lang ch
citeindex scanned.pdf --lang auto --page-range "1-10"
citeindex paper.pdf --no-layout # disable column/footnote detection
citeindex -v paper.pdf # verbose/debug logging
Python API
from citeindex import ingest, IngestionConfig, IngestionFailure, PipelineResult
# Simple
result = ingest("paper.pdf")
print(result["status"]) # "ok"
# With config
config = IngestionConfig(
llm_model="ollama/qwen3",
text_direction="vertical",
is_primary=True,
)
result = ingest("paper.pdf", corpus_root="my_corpus", config=config)
Ingestion Pipelines
CiteIndex automatically detects the input type and routes to the correct pipeline:
Digital PDF
PDF → PyMuPDF (text + images) → GROBID / DSPy citation enrichment
→ page-paragraph document structure
→ PageIndex tree (optional, LLM-driven)
→ section_tree + heading injection for document.json / library markdown
→ Merkle tree → store to corpus/
- GROBID extracts metadata and references deterministically
- PyMuPDF extracts page text directly from digital PDFs and pulls embedded images
- DSPy reconciles GROBID output with pattern extraction as fallback
- Builds page-based document structure and augments it with PageIndex section headings
- PageIndex builds LLM-driven section hierarchy, persists it to corpus, and feeds library markdown headings
Scanned PDF
PDF → OCRmyPDF (normalize) → PaddleOCR (vertical detect) → MinerU (layout)
→ Tesseract (text) → GROBID (citations) → document structure
→ Merkle tree → store to corpus/
- OCRmyPDF normalizes and adds text layer to scanned pages
- PaddleOCR detects CJK vertical text layouts
- Tesseract provides OCR with auto-detected language
- Supports
--text-direction verticalfor traditional Chinese/Japanese
URL Article
URL → Playwright/requests (fetch) → trafilatura/readability (content)
→ Zotero (metadata) → in-page citation guidance (regex → DSPy fallback)
→ section-hierarchical paragraphs → PageIndex tree (optional)
→ hashes → Merkle tree → store to corpus/
- Playwright renders JavaScript-heavy pages (fallback to requests)
- trafilatura extracts clean markdown with heading structure (fallback to readability-lxml)
- Zotero extracts citation metadata via translation-server (title, authors, date, DOI)
- Discovers in-page citation guidance: 若要引用 / 引用格式 / Cite this / Zitierweise / Pour citer
- Parses citation strings with regex first, DSPy fallback for unparseable formats
- Citation guidance overrides Zotero/trafilatura metadata (more authoritative)
- Supports batch crawling with
--all-url-articleand--update-url-article
Media
URL/File → yt-dlp (download) → ffmpeg (audio) → WhisperX (transcription)
→ pyannote (diarization, optional) → CSL JSON
→ chunking → hashes → Merkle tree → store to corpus/
- yt-dlp downloads from YouTube, Vimeo, podcasts, etc.
- WhisperX transcribes with word-level timestamps
- pyannote speaker diarization (optional)
- Supports audio (
.mp3,.wav,.m4a) and video (.mp4,.mkv,.webm)
Office & DJVU
Office documents (.docx, .doc, .rtf, .odt, .pptx, .ppt, .odp) are converted to PDF via LibreOffice, and DJVU (.djvu) via ddjvu, then routed to the digital or scanned PDF pipeline.
Citation Enrichment Cascade
For PDF inputs, CiteIndex enriches metadata through a priority cascade:
- GROBID — deterministic metadata + references (primary)
- LLM extraction — DSPy-based citation parsing (fallback)
- PDF metadata — basic file metadata only (last resort)
For web pages with ambiguous metadata, a local Perplexica search API can fill missing citation fields (title, author, publisher).
Configuration Reference
| Option | CLI Flag | Default | Description |
|---|---|---|---|
llm_model |
--llm |
ollama/deepseek-v4-flash:cloud |
LLM model (ollama/name or gemini/name) |
text_direction |
--text-direction, -td |
horizontal |
horizontal, auto, or vertical |
vertical_lang |
--vertical-lang |
ch |
CJK language: ch (Chinese) or japan |
lang |
--lang, -l |
auto |
OCR language (auto-detect or Tesseract code) |
page_range |
--page-range, -p |
1-5, -3 |
Pages to extract (e.g. "1-10", "1-5, -3") |
doc_type_override |
--type, -t |
auto | book, thesis, journal, or bookchapter |
use_layout_analysis |
--no-layout |
True |
Disable column/footnote detection |
is_primary |
--is-primary |
False |
Line-level granularity (vs paragraph-level) |
use_pageindex |
--no-pageindex |
True |
PageIndex hierarchy is enabled by default; pass --no-pageindex to disable it |
pageindex_model |
--pageindex-model |
ollama/deepseek-v4-flash:cloud |
LLM for PageIndex tree building |
citation_style |
(API only) | chicago-author-date |
CSL citation style for output |
corpus_root |
--corpus-root |
corpus |
Output directory for ingested artifacts |
schema_version |
--schema-version |
1.0.0 |
Output schema version tag |
| (CLI only) | --crawl-depth |
2 |
Max BFS crawl depth for --all-url-article |
| (CLI only) | --crawl-max-pages |
100 |
Max pages for --all-url-article |
| (CLI only) | --verbose, -v |
off | Enable verbose/debug logging |
Output
Each ingestion produces a corpus folder (e.g., corpus/Author_2024_Title/) and a companion library markdown file:
Corpus artifacts (corpus/Author_2024_Title/)
| File | Description |
|---|---|
csl.json |
Citation metadata (CSL-JSON with ci_* extensions: content_hash, merkle_root, source_type, ingestion_timestamp) |
document.json |
Structured document tree — pages, paragraphs, and section_tree for URL articles and PageIndex-augmented PDFs |
pageindex_tree.json |
Persisted CiteIndex/PageIndex hierarchy with page ranges and summaries when PageIndex runs |
merkle.json |
SHA-256 Merkle tree for integrity verification |
transcript.json |
Timestamped transcript with speaker segments (media only) |
media_metadata.json |
Source media metadata (media only) |
ingestion_output.json |
Full ingestion result with all pipeline outputs |
Library markdown (library/Author_2024_Title.md)
Human-readable markdown with YAML front-matter, inline citation, page/section/timestamp headers with CSL-level detail, full extracted text, and footnotes. When PageIndex is available, digital PDFs emit section headings into the markdown instead of only flat page labels. Written to library/ (sibling of corpus/).
Ingestion log (corpus/ingestion_log.jsonl)
Appended on every ingestion with input_ref, resource_type, csl_id, merkle_root, and ingestion_timestamp.
URL content hashes (corpus/_url_content_hashes.json)
Persisted URL → content-hash mapping used by --update-url-article for change detection.
Return Value
The ingest() function returns a dict:
{
"schema_version": "1.0.0",
"status": "ok", # "ok" or "blocked"
"document_path": "corpus/Author_2024_Title",
"standardized_csl_json": { ... }, # Full CSL-JSON with ci_ extensions
"sub_pipeline_outputs": { ... }, # Raw pipeline results
"ingestion_log_entry": { ... }, # Log entry with merkle_root
"library_md_path": "library/Author_2024_Title.md",
}
# On failure:
{
"status": "blocked",
"source_id": "unknown",
"stage": "detect_resource_type",
"error_code": "unsupported_input",
"error_message": "Unsupported input: ...",
"next_action": "Provide PDF, URL, or media file",
}
Batch URL Ingestion Return
The ingest_all_urls() method (triggered by --all-url-article / --update-url-article) returns:
{
"status": "ok",
"root_url": "https://example.com/articles",
"discovered": 25, # total article URLs found
"ingested": 20, # newly ingested
"updated": 2, # re-ingested (content changed)
"skipped": 3, # unchanged (--update-url-article only)
"failed": 0, # errors
"results": [ # per-URL status list
{"url": "...", "status": "ok"},
{"url": "...", "status": "unchanged"},
...
]
}
Supported Formats
| Format | Extension / Protocol |
|---|---|
| Digital PDF | .pdf (with embedded text) |
| Scanned PDF | .pdf (image-based, OCR applied) |
| URL Article | http:// / https:// |
| Media | .mp3, .wav, .m4a, .mp4, .mkv, .webm |
| Office | .docx, .doc, .rtf, .odt, .pptx, .ppt, .odp |
| DJVU | .djvu |
Citation
If you use CiteIndex in your work, please cite it:
APA:
ajia. (2025). CiteIndex: Ingest sources with proper citation (Version 0.12.0). MIT. https://github.com/ajia/citeindex
BibTeX:
@software{citeindex2025,
author = {ajia},
title = {CiteIndex: Ingest sources with proper citation},
version = {0.12.0},
year = {2025},
license = {MIT},
url = {https://github.com/ajia/citeindex},
}
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file citeindex-0.12.1.tar.gz.
File metadata
- Download URL: citeindex-0.12.1.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46a64f96661fee354018342e78ab50d5d0da780b70922512d1a955abaaed0519
|
|
| MD5 |
a68cc0ed692e407c2e27fdd3b7540429
|
|
| BLAKE2b-256 |
df9dfb815c85eb6299c3299c55549bc3cdce77320d3db13cab3d424678eb24ce
|
File details
Details for the file citeindex-0.12.1-py3-none-any.whl.
File metadata
- Download URL: citeindex-0.12.1-py3-none-any.whl
- Upload date:
- Size: 141.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e64665e5a08c46ae2dfadf3337b22bbba63e9f96e1de50111c04e880bfb816ce
|
|
| MD5 |
5573a9119fde8cf85862268bfedc4b52
|
|
| BLAKE2b-256 |
5ae358a97e95967293d6d55f3b87d9c8c3ca23a54bbdf3e1ca9dfcef72cb0c71
|