Detect hallucinated references in academic papers
Project description
Python Bindings
Python bindings for the Rust hallucinator engine, powered by PyO3 and Maturin. Extract references from academic PDFs and validate them against 10 academic databases — all from Python, with Rust-native performance.
from hallucinator import PdfExtractor, Validator, ValidatorConfig
# Extract references from a PDF
ext = PdfExtractor()
result = ext.extract("paper.pdf")
print(f"Found {len(result)} references")
# Validate them against academic databases
config = ValidatorConfig()
validator = Validator(config)
results = validator.check(result.references)
for r in results:
print(f"[{r.status}] {r.title}")
Installation
From PyPI (recommended)
Pre-compiled wheels for Python 3.12 — no Rust toolchain needed:
pip install hallucinator
Available platforms: Linux (x86_64), macOS (x86_64 + Apple Silicon), Windows (x86_64).
From source
Requires Python 3.9+ and a Rust toolchain (rustup.rs).
cd hallucinator-rs
# Using uv (recommended)
uv venv && source .venv/bin/activate # or .venv\Scripts\activate on Windows
uv pip install maturin
maturin develop --release
# Or with pip
pip install maturin
maturin develop --release
After installation, the hallucinator package is importable:
>>> import hallucinator
>>> hallucinator.PdfExtractor()
PdfExtractor(...)
PDF Extraction
Quick start
from hallucinator import PdfExtractor
ext = PdfExtractor()
result = ext.extract("paper.pdf")
for ref in result.references:
print(ref.title)
print(f" Authors: {', '.join(ref.authors)}")
if ref.doi:
print(f" DOI: {ref.doi}")
PdfExtractor
The main entry point for extraction. Wraps the Rust engine and adds support for custom Python segmentation strategies.
ext = PdfExtractor()
# Full pipeline: PDF file → ExtractionResult
result = ext.extract("paper.pdf")
# Full pipeline on already-extracted text
result = ext.extract_from_text(text)
# Individual pipeline stages
text = ext.extract_text("paper.pdf") # Step 1: PDF → raw text
section = ext.find_section(text) # Step 2: locate references section
segments = ext.segment(section) # Step 3: split into individual refs
ref = ext.parse_reference(segments[0]) # Step 4: parse a single reference
Configuration
Override regex patterns and thresholds to handle non-standard paper formats.
| Property | Default | Description |
|---|---|---|
section_header_regex |
Matches "References", "Bibliography", etc. | Regex to find the start of the references section |
section_end_regex |
Matches "Appendix", "Acknowledgments", etc. | Regex to find the end of the references section |
fallback_fraction |
0.7 |
Fraction of document to skip when no header found (0.7 = use last 30%) |
ieee_segment_regex |
Matches [1], [2], etc. |
Regex for IEEE-style reference numbering |
numbered_segment_regex |
Matches 1., 2., etc. |
Regex for numbered-list references |
fallback_segment_regex |
Double newline | Fallback segmentation when no numbering detected |
min_title_words |
4 |
Minimum words in a title (shorter → skipped) |
max_authors |
15 |
Cap on extracted author count per reference |
ext = PdfExtractor()
# Handle Spanish papers
ext.section_header_regex = r"(?i)\n\s*(?:Bibliografía|Referencias)\s*\n"
# Accept shorter titles
ext.min_title_words = 3
# Custom venue cutoff (don't include journal name in title)
ext.add_venue_cutoff_pattern(r"(?i)\.\s*Nature\b.*$")
# Preserve compound words across line breaks
ext.add_compound_suffix("powered") # "AI- powered" → "AI-powered"
Custom segmentation strategies
For reference formats that no regex can handle, register a Python callable:
import re
def paren_segmenter(text: str) -> list[str] | None:
"""Split references numbered as (1), (2), (3)..."""
parts = re.split(r'\n\s*\(\d+\)\s+', text)
parts = [p.strip() for p in parts if p.strip()]
return parts if len(parts) >= 3 else None
ext = PdfExtractor()
ext.add_segmentation_strategy(paren_segmenter)
result = ext.extract("unusual_paper.pdf")
Strategies are tried in registration order. Return None (or fewer than 3 items) to fall through to the next strategy, then to Rust built-ins.
ext.add_segmentation_strategy(try_format_a)
ext.add_segmentation_strategy(try_format_b)
# Falls through: try_format_a → try_format_b → Rust built-ins
ext.clear_segmentation_strategies() # Remove all custom strategies
Archive extraction
Extract references from ZIP or tar.gz archives containing PDFs, BBL, or BIB files:
ext = PdfExtractor()
for entry in ext.extract_archive("papers.zip"):
print(f"{entry.filename} ({entry.file_type})")
if entry.result: # PDF — full extraction
for ref in entry.result.references:
print(f" {ref.title}")
elif entry.content: # BBL/BIB — raw text
print(f" {len(entry.content)} chars")
Use max_size_bytes to cap total extracted size (0 = unlimited):
for entry in ext.extract_archive("papers.tar.gz", max_size_bytes=100_000_000):
...
Check the iterator's warnings for size-limit messages after iteration.
The is_archive_path() helper detects supported formats:
from hallucinator import is_archive_path
is_archive_path("papers.zip") # True
is_archive_path("papers.tar.gz") # True
is_archive_path("paper.pdf") # False
ArchiveEntry
Each item yielded from archive iteration:
entry.filename # str — original filename within the archive
entry.file_type # str — "pdf", "bbl", or "bib"
entry.result # ExtractionResult | None — populated for PDFs
entry.content # str | None — populated for BBL/BIB files
ExtractionResult
Returned by extract() and extract_from_text().
result = ext.extract("paper.pdf")
result.references # list[Reference]
len(result) # number of parsed references
# Skip statistics
result.skip_stats.total_raw # total raw segments before filtering
result.skip_stats.url_only # skipped: non-academic URLs only
result.skip_stats.short_title # skipped: title too short
result.skip_stats.no_title # references with no parseable title
result.skip_stats.no_authors # references with no parseable authors
Reference
A parsed reference with structured fields.
ref.raw_citation # str — the cleaned-up citation text
ref.title # str | None — extracted title
ref.authors # list[str] — author names
ref.doi # str | None — DOI if found
ref.arxiv_id # str | None — arXiv ID if found
ref.original_number # int — 1-based position in the PDF (0 for manually created refs)
ref.skip_reason # str | None — why this ref was skipped ("url_only", "short_title"), or None
Creating references manually
You can create Reference objects directly — without PDF extraction — for batch validation of structured data (e.g. from a CSV, BibTeX parser, or API response):
from hallucinator import Reference
ref = Reference("Attention Is All You Need", authors=["Vaswani", "Shazeer"])
ref = Reference("BERT", doi="10.18653/v1/N19-1423")
ref = Reference("My Paper", authors=["Smith"], arxiv_id="2301.00001")
Constructor signature:
Reference(
title: str,
authors: list[str] = [],
doi: str | None = None,
arxiv_id: str | None = None,
raw_citation: str | None = None, # defaults to title if omitted
)
Batch validation without PDF extraction
For use cases where you already have structured reference data (titles, authors, DOIs) and want to validate without going through PDF extraction:
from hallucinator import Reference, Validator, ValidatorConfig
# Build references from structured data
refs = [
Reference("Attention Is All You Need", authors=["Vaswani", "Shazeer"]),
Reference("BERT: Pre-training of Deep Bidirectional Transformers",
authors=["Devlin", "Chang"], doi="10.18653/v1/N19-1423"),
Reference("A Completely Made Up Paper Title That Does Not Exist"),
]
# Validate
config = ValidatorConfig()
validator = Validator(config)
results = validator.check(refs)
for r in results:
print(f"[{r.status}] {r.title}")
Reference Validation
After extracting references, validate them against academic databases. The validator queries up to 10 databases concurrently per reference, with early exit on first match.
Quick start
from hallucinator import PdfExtractor, Validator, ValidatorConfig
ext = PdfExtractor()
result = ext.extract("paper.pdf")
config = ValidatorConfig()
validator = Validator(config)
results = validator.check(result.references)
for r in results:
if r.status == "verified":
print(f" OK: {r.title} (via {r.source})")
elif r.status == "not_found":
print(f" ?? {r.title}")
elif r.status == "author_mismatch":
print(f" ~~ {r.title} (authors don't match)")
ValidatorConfig
All configuration for database queries. Create one, tweak what you need, pass it to Validator().
config = ValidatorConfig()
API keys
config.s2_api_key = "your-semantic-scholar-key"
config.openalex_key = "your-openalex-key"
config.crossref_mailto = "you@university.edu" # CrossRef polite pool
Concurrency and timeouts
config.num_workers = 4 # references checked in parallel (default: 4)
config.db_timeout_secs = 10 # per-database timeout (default: 10)
config.db_timeout_short_secs = 5 # short timeout for fast DBs (default: 5)
config.max_rate_limit_retries = 3 # max 429 retries per DB query (default: 3)
Persistent cache
config.cache_path = "/path/to/cache.db" # SQLite cache for cross-run persistence
# Cache TTL tuning (optional)
config.cache_positive_ttl_secs = 604800 # verified results (default: 7 days)
config.cache_negative_ttl_secs = 86400 # not-found results (default: 24 hours)
SearxNG web search fallback
config.searxng_url = "http://localhost:8888" # optional SearxNG instance URL
Disable databases
config.disabled_dbs = ["openalex", "pubmed"]
Database names: crossref, arxiv, dblp, semantic_scholar, acl, neurips, ssrn, europe_pmc, pubmed, openalex.
Offline databases
Point to local SQLite databases for DBLP and ACL Anthology (built with the CLI's update-dblp / update-acl commands). Dramatically faster than online queries.
config.dblp_offline_path = "/path/to/dblp.db"
config.acl_offline_path = "/path/to/acl.db"
config.openalex_offline_path = "/path/to/openalex.idx"
If the path doesn't exist or the file isn't a valid database, Validator(config) raises RuntimeError.
Author checking
config.check_openalex_authors = True # verify authors for OpenAlex matches (default: False)
Validator
The main validation engine. Create it once, call check() as many times as needed.
validator = Validator(config)
check()
Validates a list of Reference objects against all enabled databases. Blocks until complete but releases the Python GIL, so other threads can run.
results = validator.check(references)
# or with a progress callback:
results = validator.check(references, progress=on_progress)
Returns list[ValidationResult].
Progress callbacks
Pass a callable to check() to receive real-time progress events:
def on_progress(event):
if event.event_type == "checking":
print(f"[{event.index + 1}/{event.total}] Checking: {event.title}")
elif event.event_type == "result":
r = event.result
print(f"[{event.index + 1}/{event.total}] {r.status}: {r.title}")
elif event.event_type == "warning":
print(f"Warning: {event.title} — {event.message}")
elif event.event_type == "retrying":
print(f"Retrying: {event.title} (failed: {', '.join(event.failed_dbs)})")
elif event.event_type == "retry_pass":
print(f"Retrying {event.count} unresolved references...")
elif event.event_type == "db_query_complete":
print(f" {event.db_name}: {event.db_status} ({event.elapsed_ms:.0f}ms)")
elif event.event_type == "rate_limit_wait":
print(f" Rate limited on {event.db_name}, waiting {event.wait_ms:.0f}ms...")
elif event.event_type == "rate_limit_retry":
print(f" Retrying {event.db_name} (attempt {event.attempt}, backoff {event.backoff_ms:.0f}ms)")
results = validator.check(refs, progress=on_progress)
ProgressEvent properties
All properties return None when not applicable to the event type.
| Property | Type | Event types |
|---|---|---|
event_type |
str |
all |
index |
int |
checking, result, warning, retrying |
total |
int |
checking, result, warning, retrying |
title |
str |
checking, warning, retrying |
result |
ValidationResult |
result |
failed_dbs |
list[str] |
warning, retrying |
message |
str |
warning |
count |
int |
retry_pass |
paper_index |
int |
db_query_complete |
ref_index |
int |
db_query_complete, rate_limit_retry |
db_name |
str |
db_query_complete, rate_limit_wait, rate_limit_retry |
db_status |
str |
db_query_complete |
elapsed_ms |
float |
db_query_complete |
attempt |
int |
rate_limit_retry |
wait_ms |
float |
rate_limit_wait |
backoff_ms |
float |
rate_limit_retry |
Cancellation
Cancel a running check from another thread:
import threading
validator = Validator(config)
def run_check():
results = validator.check(refs)
t = threading.Thread(target=run_check)
t.start()
# Cancel after 30 seconds
import time
time.sleep(30)
validator.cancel()
t.join()
Stats
Compute summary statistics from results:
stats = Validator.stats(results)
print(f"Total: {stats.total}")
print(f"Verified: {stats.verified}")
print(f"Not found: {stats.not_found}")
print(f"Author mismatch: {stats.author_mismatch}")
print(f"Retracted: {stats.retracted}")
print(f"Skipped: {stats.skipped}")
ValidationResult
The result of checking a single reference.
r = results[0]
r.title # str — reference title
r.raw_citation # str — original citation text
r.status # "verified" | "not_found" | "author_mismatch"
r.source # str | None — database that verified it (e.g. "crossref")
r.ref_authors # list[str] — authors from the parsed reference
r.found_authors # list[str] — authors from the matching DB record
r.paper_url # str | None — URL in the matching database
r.failed_dbs # list[str] — databases that timed out or errored
Per-database results
Every database query is recorded, even if it didn't match:
for db in r.db_results:
print(f" {db.db_name}: {db.status}", end="")
if db.elapsed_ms is not None:
print(f" ({db.elapsed_ms:.0f}ms)", end="")
if db.paper_url:
print(f" → {db.paper_url}", end="")
print()
DbResult.status values: "match", "no_match", "author_mismatch", "timeout", "rate_limited", "error", "skipped".
DOI and arXiv info
if r.doi_info:
print(f"DOI: {r.doi_info.doi} (valid={r.doi_info.valid})")
if r.doi_info.title:
print(f" Resolved title: {r.doi_info.title}")
if r.arxiv_info:
print(f"arXiv: {r.arxiv_info.arxiv_id} (valid={r.arxiv_info.valid})")
Retraction info
if r.retraction_info and r.retraction_info.is_retracted:
print(f"RETRACTED!")
if r.retraction_info.retraction_doi:
print(f" Retraction DOI: {r.retraction_info.retraction_doi}")
if r.retraction_info.retraction_source:
print(f" Source: {r.retraction_info.retraction_source}")
Complete example
Extract, validate, and report — the full pipeline:
from hallucinator import PdfExtractor, Validator, ValidatorConfig
# Extract
ext = PdfExtractor()
result = ext.extract("paper.pdf")
refs = result.references
print(f"Extracted {len(refs)} references")
# Configure
config = ValidatorConfig()
config.s2_api_key = "your-key" # optional but improves results
config.dblp_offline_path = "dblp.db" # optional, faster than online
config.disabled_dbs = ["openalex"] # skip DBs you don't need
# Validate with progress
def on_progress(event):
if event.event_type == "checking":
print(f" [{event.index + 1}/{event.total}] {event.title}")
elif event.event_type == "result":
r = event.result
icon = {"verified": "+", "not_found": "?", "author_mismatch": "~"}[r.status]
src = f" ({r.source})" if r.source else ""
print(f" [{icon}] {r.title}{src}")
validator = Validator(config)
results = validator.check(refs, progress=on_progress)
# Summary
stats = Validator.stats(results)
print(f"\nVerified: {stats.verified}/{stats.total}")
if stats.not_found:
print(f"Potentially hallucinated: {stats.not_found}")
if stats.retracted:
print(f"Retracted: {stats.retracted}")
# Flag suspicious references
for r in results:
if r.status == "not_found":
print(f"\n NOT FOUND: {r.title}")
print(f" Citation: {r.raw_citation[:120]}...")
if r.retraction_info and r.retraction_info.is_retracted:
print(f"\n RETRACTED: {r.title}")
API Reference
Extraction types
| Class | Description |
|---|---|
PdfExtractor |
Configurable PDF extraction pipeline with custom strategy support |
ExtractionResult |
Container for parsed references and skip statistics |
Reference |
A parsed reference (title, authors, DOI, arXiv ID) — also constructible manually |
SkipStats |
Counts of skipped references by reason |
ArchiveEntry |
A single entry yielded from archive extraction |
ArchiveIterator |
Iterator over archive entries |
is_archive_path() |
Returns True if a path looks like a supported archive |
Validation types
| Class | Description |
|---|---|
ValidatorConfig |
Configuration: API keys, timeouts, concurrency, offline DBs, cache, SearxNG |
Validator |
Validation engine — call .check(refs) to validate |
ValidationResult |
Per-reference result: status, source, authors, per-DB details |
DbResult |
Single database query result: status, elapsed time, found authors |
DoiInfo |
DOI resolution result |
ArxivInfo |
arXiv resolution result |
RetractionInfo |
Retraction check result |
ProgressEvent |
Real-time progress callback event |
CheckStats |
Summary statistics (verified, not_found, author_mismatch, retracted) |
Status values
ValidationResult.status: "verified" | "not_found" | "author_mismatch"
DbResult.status: "match" | "no_match" | "author_mismatch" | "timeout" | "rate_limited" | "error" | "skipped"
ProgressEvent.event_type: "checking" | "result" | "warning" | "retrying" | "retry_pass" | "db_query_complete" | "rate_limit_wait" | "rate_limit_retry"
Examples
See python/examples/ for runnable scripts:
| Example | Description |
|---|---|
basic_usage.py |
Extract references from a PDF |
step_by_step.py |
Run each pipeline stage individually |
custom_regexes.py |
Override patterns for non-standard formats |
validate_references.py |
Full pipeline: extract + validate + report |
batch_validate.py |
Validate references without PDF extraction (#178) |
Threading and performance
- GIL release:
Validator.check()releases the Python GIL during the Rust async runtime call. Other Python threads can execute freely while validation runs. - Concurrency: References are checked in parallel (default 4 at a time). All 10 databases are queried concurrently per reference. First match triggers early exit.
- Progress callbacks: The GIL is briefly re-acquired to call Python progress callbacks. Since events fire once per reference (not per HTTP request), overhead is negligible.
- Tokio runtime: Each
Validatorinstance owns a tokio multi-threaded runtime. Creating many validators is wasteful — reuse a single instance for multiplecheck()calls.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hallucinator-0.1.2.tar.gz.
File metadata
- Download URL: hallucinator-0.1.2.tar.gz
- Upload date:
- Size: 757.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64af75c9f7cb728ed9d4084c0088c17d8c4036fbe733a7b8b9efad8d158cc697
|
|
| MD5 |
0bd51b151f7d255d29873e3241c78d8e
|
|
| BLAKE2b-256 |
15fcdde644bf994fb029ed4f2b638b5aec1f9cfcca6e0b2aedc3467da89554e9
|
File details
Details for the file hallucinator-0.1.2-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: hallucinator-0.1.2-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 11.0 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aff484da75b4e2b16674c48d358a76e340efd7dca4ee5989ac5e95361d17c53d
|
|
| MD5 |
557a913e4e125287dd4337915ca6604c
|
|
| BLAKE2b-256 |
b1a93c2921a816b4a4dcd5cdd4121d1079f656457bb31e66060fc78c32bbc1ca
|
File details
Details for the file hallucinator-0.1.2-cp312-cp312-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: hallucinator-0.1.2-cp312-cp312-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 13.3 MB
- Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1fb4988ade59e351119bc692b9d08b41b6eb2ea875c04a3ef91727374490d26
|
|
| MD5 |
1a16a7625743e664da3403117bd1f3a5
|
|
| BLAKE2b-256 |
e372c7c1888e6774eefa46a67e4bee7cfc3875c4e4fea639f8dde8b3f80fefb5
|
File details
Details for the file hallucinator-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: hallucinator-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 10.9 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa38194c94c5e5533c4954c0b3ee0bb19188d4a72519ee667e1956c4bab3d5e1
|
|
| MD5 |
a3e074f182475caa4d013ae24efca807
|
|
| BLAKE2b-256 |
8beeb711f97a842fcbf0922f6e50e432d8a0fc5779e31d03af1e1eaf91336ec7
|
File details
Details for the file hallucinator-0.1.2-cp312-cp312-macosx_10_15_x86_64.whl.
File metadata
- Download URL: hallucinator-0.1.2-cp312-cp312-macosx_10_15_x86_64.whl
- Upload date:
- Size: 11.4 MB
- Tags: CPython 3.12, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
adbea16f8c086f40a163bde6f0405429cf78263c32b57c671afa985120c14a35
|
|
| MD5 |
879c7f8dbae481c49db7ddbd529463f7
|
|
| BLAKE2b-256 |
88fbae99e5fae1fef97029d681d82424c5e77b620299443042e1ff798bf016bc
|