Skip to main content

A Python library for filtering academic references to match only in-text citations, eliminating hallucinated or non-cited references

Project description

Citation Hallucination Stop Library

A Python library for filtering academic references to match only in-text citations, eliminating hallucinated or non-cited references.

Features

  • Smart Citation Matching: Accurately matches in-text citations to reference entries
  • Multiple Citation Formats: Handles (Author, Year), (Author et al., Year), (Author & Author, Year)
  • Fuzzy Matching: Supports name variants (Mc/Mac, hyphenated names, accents)
  • APA Format Validation: Ensures references follow proper APA formatting
  • Detailed Statistics: Provides match/unmatch statistics and reduction percentages
  • Format Agnostic: Works with markdown, LaTeX, plain text, and other formats

Installation

pip install citation-hallucination-stop

Or install from source:

git clone https://github.com/danbroz/CitationHallucinationStop.git
cd CitationHallucinationStop
pip install -e .

Quick Start

from citation_hallucination_stop import CitationHallucinationStop

# Initialize the cleaner
cleaner = CitationHallucinationStop()

# Your text content with citations
text_content = """
This is a sample text with citations (Smith, 2021; Johnson et al., 2020).
Another citation here (Brown & Davis, 2019).
"""

# Your reference list
all_references = [
    "Smith, J. (2021). Title of paper. Journal Name.",
    "Johnson, A., Wilson, B., & Lee, C. (2020). Another paper. Journal Name.",
    "Brown, M., & Davis, K. (2019). Third paper. Journal Name.",
    "Uncited, R. (2022). This won't be included. Journal Name."
]

# Clean the references
result = cleaner.clean_references(text_content, all_references)

# Get filtered references
filtered_refs = result['references']
print("Filtered References:")
for ref in filtered_refs:
    print(ref)

# Get statistics
stats = result['statistics']
print(f"\nStatistics:")
print(f"Total citations: {stats['total_citations']}")
print(f"Matched citations: {stats['matched_citations']}")
print(f"References reduced: {stats['reduction_percentage']}%")

API Reference

CitationHallucinationStop Class

__init__(strict_mode=True)

Initialize the citation cleaner.

  • strict_mode (bool): If True, only include references with valid APA format

clean_references(text_content, all_references)

Clean references to only include those actually cited in the text.

Parameters:

  • text_content (str): Text content with citations
  • all_references (List[str]): List of all available references

Returns:

  • Dictionary with:
    • references: List of numbered, filtered references
    • unmatched_references: List of matched references (unnumbered)
    • statistics: Dictionary with match statistics
    • match_scores: Dictionary of reference match scores

extract_citations(text)

Extract all parenthetical citations from text.

Parameters:

  • text (str): Text content to extract citations from

Returns:

  • List of (surname, year) tuples from citations

clean_document(content, reference_section)

Clean a document by replacing the references section with filtered references.

Parameters:

  • content (str): Full document content
  • reference_section (str): References section to replace

Returns:

  • Document with cleaned references

get_unmatched_citations(text_content, all_references)

Get list of citations that couldn't be matched to references.

Parameters:

  • text_content (str): Text content with citations
  • all_references (List[str]): List of all available references

Returns:

  • List of unmatched citation strings

Examples

Basic Usage

from citation_hallucination_stop import CitationHallucinationStop

cleaner = CitationHallucinationStop()
result = cleaner.clean_references(text, references)

Batch Processing

import os
from citation_hallucination_stop import CitationHallucinationStop

cleaner = CitationHallucinationStop()

# Process multiple files
for filename in os.listdir('documents/'):
    if filename.endswith('.md'):
        with open(f'documents/{filename}', 'r') as f:
            content = f.read()
        
        # Extract references section
        parts = content.split('\n## References\n')
        if len(parts) == 2:
            main_content, refs_section = parts
            ref_lines = [line.strip() for line in refs_section.split('\n') 
                        if line.strip() and line.startswith(('1.', '2.', '3.'))]
            
            # Clean references
            result = cleaner.clean_references(content, ref_lines)
            
            # Save cleaned document
            cleaned_content = main_content + '\n\n## References\n\n'
            cleaned_content += '\n'.join(result['references'])
            
            with open(f'cleaned_{filename}', 'w') as f:
                f.write(cleaned_content)

Advanced Configuration

# Use non-strict mode to include references with formatting issues
cleaner = CitationHallucinationStop(strict_mode=False)

# Get detailed statistics
result = cleaner.clean_references(text, references)
stats = result['statistics']

print(f"Reduced references from {stats['total_references']} to {stats['filtered_references']}")
print(f"Match rate: {stats['matched_citations']}/{stats['total_citations']} citations matched")

Citation Formats Supported

The library recognizes these citation formats:

  • (Author, Year) - Single author
  • (Author & Author, Year) - Two authors
  • (Author et al., Year) - Multiple authors
  • (Author, Year; Author, Year) - Multiple citations
  • (Author, Year, p. 123) - Citations with page numbers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

v1.0.0

  • Initial release
  • Core citation matching functionality
  • APA format validation
  • Fuzzy matching for name variants
  • Detailed statistics and reporting

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citation_hallucination_stop-1.0.0.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citation_hallucination_stop-1.0.0-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file citation_hallucination_stop-1.0.0.tar.gz.

File metadata

File hashes

Hashes for citation_hallucination_stop-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7c37696d05d636daf969935e2832d20b21952602e076d646127d84af5dc29a37
MD5 2d4a904fadfd30ad9d721fe3ec0ef192
BLAKE2b-256 b8e459dfa5a125ee7c8688cb5331dfc5d6d1424c8afac0d24219ffe839cd9b4b

See more details on using hashes here.

File details

Details for the file citation_hallucination_stop-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for citation_hallucination_stop-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec961b3802798ff3f188c9f2aaf2e62a63c348a855ea6e3dc4192faa993de976
MD5 9e164ac3f8d7bce0aa0b169f1cff48f5
BLAKE2b-256 633a746f52a6bfa732f4e5af10b5746a80d4d7d6b97b5d39c9a6dc112ed4b00c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page