Skip to main content

FiCi: a lightweight detector for fake/hallucinated citations in scientific papers.

Reason this release was yanked:

This version has wrong requirements versioning.

Project description

FiCi

FiCi (Fictitious Citations) is a lightweight Python package for detecting fabricated or hallucinated citations in scientific PDFs. It's tuned for standard single-/double-column conference layouts (NeurIPS, ICLR, ACM acmart / SIG conf) and avoids LLMs or heavy ML models.

Install

From PyPI:

pip install fici

From source (editable, for development):

git clone https://github.com/sadjadeb/fici.git
cd fici
pip install -e ".[dev]"

Command-line usage

Installing the package registers a fici console script:

fici paper.pdf --email you@example.org

Useful flags:

fici paper.pdf --email you@example.org --workers 8        # more concurrency
fici paper.pdf --email you@example.org --json > out.json  # machine-readable
fici paper.pdf --email you@example.org --quiet            # summary only
fici --help

The CLI returns a non-zero exit code if any citation is flagged, which makes it easy to drop into CI pipelines:

Exit code Meaning
0 All references verified.
1 At least one reference is flagged or errored.
2 Bad input (e.g. PDF not found).

python -m fici ... is equivalent to the fici script if you haven't added your Python bin directory to PATH.

Programmatic usage

from fici import FiCiPipeline

pipeline = FiCiPipeline(email="you@example.org")  # polite pool
reports = pipeline.run("paper.pdf")

for r in reports:
    print(r.index, r.verdict.value, round(r.score, 1), r.suspected_title)

print(FiCiPipeline.summarize(reports))

See example.py for a complete programmatic usage example.

How it works

The pipeline has four phases, each exposed as a standalone class:

  1. Extraction (ReferenceExtractor): PyMuPDF pulls text, heuristics locate the References / Bibliography section, and regex splitters handle the dominant reference styles ([1] ..., 1. ..., Author-Year).

  2. Structuring + Search (primary) (CitationSearcher.search_openalex): each raw citation is sent to the OpenAlex /works endpoint as a free-text query (title only, for precision), using the polite pool via mailto. The hits are then handed to the verifier.

  3. Search (second opinion) (CitationSearcher.search_crossref): whenever the OpenAlex-based verdict is anything other than Verified (suspicious match, no match, or error), FiCi also queries Crossref's query.bibliographic endpoint and verifies its hits. The pipeline returns whichever of the two reports is stronger — Verified always beats other verdicts, and within the same tier the higher score wins. If OpenAlex verifies on the first try, Crossref is skipped to save latency.

  4. Verification (CitationVerifier): rapidfuzz.fuzz.token_sort_ratio compares the API-returned title to the suspected title in the raw string, with a small bonus for corroborating author surnames. The pipeline emits one of four verdicts:

    Verdict Condition
    Verified Score ≥ verify threshold (default 85).
    Suspicious/Mismatch API found candidates but score < threshold (default 75–85).
    Highly Likely Fake Neither API returned any results.
    Error API call raised an unrecoverable exception.

Tuning knobs

  • FiCiPipeline(verify_threshold=85, mismatch_threshold=75): move the cutoffs up/down to trade precision for recall.
  • FiCiPipeline(max_workers=4): API calls are dispatched concurrently via a thread pool (I/O-bound work). Default is 4, which stays under the OpenAlex / Crossref polite-pool rate limits. Set to 1 to force sequential execution, or override per-call with pipeline.run(pdf, max_workers=N).
  • CitationSearcher(max_results=5, timeout=15, retries=2): control API politeness and robustness.
  • Inject a custom ReferenceExtractor subclass if you need to support a non-standard template (e.g. workshop-specific layouts).

Current limitations

  • Title extraction from raw strings is heuristic; unusual punctuation or missing years can occasionally yield an incomplete suspected_title, which is why scoring also consults the full raw string.
  • Author matching uses surname containment rather than a structured parse. If you'd like structured parsing via anystyle or GROBID, that's a clean extension point on CitationSearcher._prepare_query.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fici-0.1.0.tar.gz (34.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fici-0.1.0-py3-none-any.whl (35.7 kB view details)

Uploaded Python 3

File details

Details for the file fici-0.1.0.tar.gz.

File metadata

  • Download URL: fici-0.1.0.tar.gz
  • Upload date:
  • Size: 34.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for fici-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ca85ba9fdfaa5808af6f6cdc4c9e1bcaf71753dcf8508edd02c493bcc1ad60ef
MD5 af07dc45dd6067903a56ceead73c2e0f
BLAKE2b-256 8fba2e224623a84ece83b253306a385fcae9d1cab6fea0ffdeacb78d4f4f6718

See more details on using hashes here.

File details

Details for the file fici-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fici-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for fici-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 889c9feef945551a5ebb2d452b09a83ffbff10d291214f99efc845f40085e79f
MD5 e945622b92675f93c11a2c11b8f47aff
BLAKE2b-256 5c0b32d91553588fba408b0d3033e46b705d076270e5838c9e5ee52365dfead1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page