Skip to main content

FiCi: a lightweight detector for fake/hallucinated citations in scientific papers.

Project description

FiCi

FiCi (Fictitious Citations) is a lightweight Python package for detecting fabricated or hallucinated citations in scientific PDFs. It's tuned for standard single-/double-column conference layouts (NeurIPS, ICLR, ACM acmart / SIG conf) and avoids LLMs or heavy ML models.

Install

From PyPI:

pip install fici

From source (editable, for development):

git clone https://github.com/sadjadeb/fici.git
cd fici
pip install -e ".[dev]"

Command-line usage

Installing the package registers a fici console script:

fici paper.pdf --email you@example.org

Useful flags:

fici paper.pdf --email you@example.org --workers 8        # more concurrency
fici paper.pdf --email you@example.org --json > out.json  # machine-readable
fici paper.pdf --email you@example.org --quiet            # summary only
fici --help

The CLI returns a non-zero exit code if any citation is flagged, which makes it easy to drop into CI pipelines:

Exit code Meaning
0 All references verified.
1 At least one reference is flagged or errored.
2 Bad input (e.g. PDF not found).

python -m fici ... is equivalent to the fici script if you haven't added your Python bin directory to PATH.

Programmatic usage

from fici import FiCiPipeline

pipeline = FiCiPipeline(email="you@example.org")  # polite pool
reports = pipeline.run("paper.pdf")

for r in reports:
    print(r.index, r.verdict.value, round(r.score, 1), r.suspected_title)

print(FiCiPipeline.summarize(reports))

See example.py for a complete programmatic usage example.

How it works

The pipeline has four phases, each exposed as a standalone class:

  1. Extraction (ReferenceExtractor): PyMuPDF pulls text, heuristics locate the References / Bibliography section, and regex splitters handle the dominant reference styles ([1] ..., 1. ..., Author-Year).

  2. Structuring + Search (primary) (CitationSearcher.search_openalex): each raw citation is sent to the OpenAlex /works endpoint as a free-text query (title only, for precision), using the polite pool via mailto. The hits are then handed to the verifier.

  3. Search (second opinion) (CitationSearcher.search_crossref): whenever the OpenAlex-based verdict is anything other than Verified (suspicious match, no match, or error), FiCi also queries Crossref's query.bibliographic endpoint and verifies its hits. The pipeline returns whichever of the two reports is stronger — Verified always beats other verdicts, and within the same tier the higher score wins. If OpenAlex verifies on the first try, Crossref is skipped to save latency.

  4. Verification (CitationVerifier): rapidfuzz.fuzz.token_sort_ratio compares the API-returned title to the suspected title in the raw string, with a small bonus for corroborating author surnames. The pipeline emits one of four verdicts:

    Verdict Condition
    Verified Score ≥ verify threshold (default 85).
    Suspicious/Mismatch API found candidates but score < threshold (default 75–85).
    Highly Likely Fake Neither API returned any results.
    Error API call raised an unrecoverable exception.

Tuning knobs

  • FiCiPipeline(verify_threshold=85, mismatch_threshold=75): move the cutoffs up/down to trade precision for recall.
  • FiCiPipeline(max_workers=4): API calls are dispatched concurrently via a thread pool (I/O-bound work). Default is 4, which stays under the OpenAlex / Crossref polite-pool rate limits. Set to 1 to force sequential execution, or override per-call with pipeline.run(pdf, max_workers=N).
  • CitationSearcher(max_results=5, timeout=15, retries=2): control API politeness and robustness.
  • Inject a custom ReferenceExtractor subclass if you need to support a non-standard template (e.g. workshop-specific layouts).

Current limitations

  • Title extraction from raw strings is heuristic; unusual punctuation or missing years can occasionally yield an incomplete suspected_title, which is why scoring also consults the full raw string.
  • Author matching uses surname containment rather than a structured parse. If you'd like structured parsing via anystyle or GROBID, that's a clean extension point on CitationSearcher._prepare_query.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fici-0.1.1.tar.gz (34.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fici-0.1.1-py3-none-any.whl (35.7 kB view details)

Uploaded Python 3

File details

Details for the file fici-0.1.1.tar.gz.

File metadata

  • Download URL: fici-0.1.1.tar.gz
  • Upload date:
  • Size: 34.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for fici-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4e32d7ffc9c748444d030ce779c46c75f2b0bc7af9e7832e2acc69ab4c4042c5
MD5 45097ced81c37767686210d5ee667e6e
BLAKE2b-256 0af772dccacd094a0e58c8094fa6e5caa6f9af1846f2219ba5ef693b5d07a7d9

See more details on using hashes here.

File details

Details for the file fici-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fici-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 35.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for fici-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 93b2f61198d14ae264c94dc1d05433c805160e2f41e9fbb1e2f44ffe5dd06248
MD5 ca98dd3c90e7fd32acb42eaed05f56f7
BLAKE2b-256 1d051e3e14ea8d463876e63d42d20ec3f7c7890d5f829cd43b3787ac2827ed82

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page