FiCi: a lightweight detector for fake/hallucinated citations in scientific papers.
Project description
FiCi
FiCi (Fictitious Citations) is a lightweight Python package for detecting fabricated or hallucinated citations in scientific PDFs. It's tuned for standard single-/double-column conference layouts (NeurIPS, ICLR, ACM acmart / SIG conf) and avoids LLMs or heavy ML models.
Install
From PyPI:
pip install fici
From source (editable, for development):
git clone https://github.com/sadjadeb/fici.git
cd fici
pip install -e ".[dev]"
Command-line usage
Installing the package registers a fici console script:
fici paper.pdf --email you@example.org
Useful flags:
fici paper.pdf --email you@example.org --workers 8 # more concurrency
fici paper.pdf --email you@example.org --json > out.json # machine-readable
fici paper.pdf --email you@example.org --quiet # summary only
fici --help
The CLI returns a non-zero exit code if any citation is flagged, which makes it easy to drop into CI pipelines:
| Exit code | Meaning |
|---|---|
0 |
All references verified. |
1 |
At least one reference is flagged or errored. |
2 |
Bad input (e.g. PDF not found). |
python -m fici ... is equivalent to the fici script if you haven't added your Python bin directory to PATH.
Programmatic usage
from fici import FiCiPipeline
pipeline = FiCiPipeline(email="you@example.org") # polite pool
reports = pipeline.run("paper.pdf")
for r in reports:
print(r.index, r.verdict.value, round(r.score, 1), r.suspected_title)
print(FiCiPipeline.summarize(reports))
See example.py for a complete programmatic usage example.
How it works
The pipeline has four phases, each exposed as a standalone class:
-
Extraction (
ReferenceExtractor): PyMuPDF pulls text, heuristics locate the References / Bibliography section, and regex splitters handle the dominant reference styles ([1] ...,1. ..., Author-Year). -
Structuring + Search (primary) (
CitationSearcher.search_openalex): each raw citation is sent to the OpenAlex/worksendpoint as a free-text query (title only, for precision), using the polite pool viamailto. The hits are then handed to the verifier. -
Search (second opinion) (
CitationSearcher.search_crossref): whenever the OpenAlex-based verdict is anything other thanVerified(suspicious match, no match, or error), FiCi also queries Crossref'squery.bibliographicendpoint and verifies its hits. The pipeline returns whichever of the two reports is stronger —Verifiedalways beats other verdicts, and within the same tier the higher score wins. If OpenAlex verifies on the first try, Crossref is skipped to save latency. -
Verification (
CitationVerifier):rapidfuzz.fuzz.token_sort_ratiocompares the API-returned title to the suspected title in the raw string, with a small bonus for corroborating author surnames. The pipeline emits one of four verdicts:Verdict Condition VerifiedScore ≥ verify threshold (default 85). Suspicious/MismatchAPI found candidates but score < threshold (default 75–85). Highly Likely FakeNeither API returned any results. ErrorAPI call raised an unrecoverable exception.
Tuning knobs
FiCiPipeline(verify_threshold=85, mismatch_threshold=75): move the cutoffs up/down to trade precision for recall.FiCiPipeline(max_workers=4): API calls are dispatched concurrently via a thread pool (I/O-bound work). Default is 4, which stays under the OpenAlex / Crossref polite-pool rate limits. Set to1to force sequential execution, or override per-call withpipeline.run(pdf, max_workers=N).CitationSearcher(max_results=5, timeout=15, retries=2): control API politeness and robustness.- Inject a custom
ReferenceExtractorsubclass if you need to support a non-standard template (e.g. workshop-specific layouts).
Current limitations
- Title extraction from raw strings is heuristic; unusual punctuation or missing years can occasionally yield an incomplete
suspected_title, which is why scoring also consults the full raw string. - Author matching uses surname containment rather than a structured parse. If you'd like structured parsing via
anystyleor GROBID, that's a clean extension point onCitationSearcher._prepare_query.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fici-0.1.1.tar.gz.
File metadata
- Download URL: fici-0.1.1.tar.gz
- Upload date:
- Size: 34.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e32d7ffc9c748444d030ce779c46c75f2b0bc7af9e7832e2acc69ab4c4042c5
|
|
| MD5 |
45097ced81c37767686210d5ee667e6e
|
|
| BLAKE2b-256 |
0af772dccacd094a0e58c8094fa6e5caa6f9af1846f2219ba5ef693b5d07a7d9
|
File details
Details for the file fici-0.1.1-py3-none-any.whl.
File metadata
- Download URL: fici-0.1.1-py3-none-any.whl
- Upload date:
- Size: 35.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93b2f61198d14ae264c94dc1d05433c805160e2f41e9fbb1e2f44ffe5dd06248
|
|
| MD5 |
ca98dd3c90e7fd32acb42eaed05f56f7
|
|
| BLAKE2b-256 |
1d051e3e14ea8d463876e63d42d20ec3f7c7890d5f829cd43b3787ac2827ed82
|