Skip to main content

A Python library for filtering academic references to match only in-text citations

Project description

Citation Hallucination Stop Library

A Python library for filtering academic references to match only in-text citations, eliminating hallucinated or non-cited references.

Features

  • Smart Citation Matching: Accurately matches in-text citations to reference entries
  • Multiple Citation Formats: Handles (Author, Year), (Author et al., Year), (Author & Author, Year)
  • Fuzzy Matching: Supports name variants (Mc/Mac, hyphenated names, accents)
  • APA Format Validation: Ensures references follow proper APA formatting
  • Detailed Statistics: Provides match/unmatch statistics and reduction percentages
  • Format Agnostic: Works with markdown, LaTeX, plain text, and other formats

Installation

pip install citation-hullination-stop

Or install from source:

git clone https://github.com/example/citation-hullination-stop.git
cd citation-hullination-stop
pip install -e .

Quick Start

from citation_hallucination_stop import CitationHallucinationStop

# Initialize the cleaner
cleaner = CitationHallucinationStop()

# Your text content with citations
text_content = """
This is a sample text with citations (Smith, 2021; Johnson et al., 2020).
Another citation here (Brown & Davis, 2019).
"""

# Your reference list
all_references = [
    "Smith, J. (2021). Title of paper. Journal Name.",
    "Johnson, A., Wilson, B., & Lee, C. (2020). Another paper. Journal Name.",
    "Brown, M., & Davis, K. (2019). Third paper. Journal Name.",
    "Uncited, R. (2022). This won't be included. Journal Name."
]

# Clean the references
result = cleaner.clean_references(text_content, all_references)

# Get filtered references
filtered_refs = result['references']
print("Filtered References:")
for ref in filtered_refs:
    print(ref)

# Get statistics
stats = result['statistics']
print(f"\nStatistics:")
print(f"Total citations: {stats['total_citations']}")
print(f"Matched citations: {stats['matched_citations']}")
print(f"References reduced: {stats['reduction_percentage']}%")

API Reference

CitationHallucinationStop Class

__init__(strict_mode=True)

Initialize the citation cleaner.

  • strict_mode (bool): If True, only include references with valid APA format

clean_references(text_content, all_references)

Clean references to only include those actually cited in the text.

Parameters:

  • text_content (str): Text content with citations
  • all_references (List[str]): List of all available references

Returns:

  • Dictionary with:
    • references: List of numbered, filtered references
    • unmatched_references: List of matched references (unnumbered)
    • statistics: Dictionary with match statistics
    • match_scores: Dictionary of reference match scores

extract_citations(text)

Extract all parenthetical citations from text.

Parameters:

  • text (str): Text content to extract citations from

Returns:

  • List of (surname, year) tuples from citations

clean_document(content, reference_section)

Clean a document by replacing the references section with filtered references.

Parameters:

  • content (str): Full document content
  • reference_section (str): References section to replace

Returns:

  • Document with cleaned references

get_unmatched_citations(text_content, all_references)

Get list of citations that couldn't be matched to references.

Parameters:

  • text_content (str): Text content with citations
  • all_references (List[str]): List of all available references

Returns:

  • List of unmatched citation strings

Examples

Basic Usage

from citation_hallucination_stop import CitationHallucinationStop

cleaner = CitationHallucinationStop()
result = cleaner.clean_references(text, references)

Batch Processing

import os
from citation_hallucination_stop import CitationHallucinationStop

cleaner = CitationHallucinationStop()

# Process multiple files
for filename in os.listdir('documents/'):
    if filename.endswith('.md'):
        with open(f'documents/{filename}', 'r') as f:
            content = f.read()
        
        # Extract references section
        parts = content.split('\n## References\n')
        if len(parts) == 2:
            main_content, refs_section = parts
            ref_lines = [line.strip() for line in refs_section.split('\n') 
                        if line.strip() and line.startswith(('1.', '2.', '3.'))]
            
            # Clean references
            result = cleaner.clean_references(content, ref_lines)
            
            # Save cleaned document
            cleaned_content = main_content + '\n\n## References\n\n'
            cleaned_content += '\n'.join(result['references'])
            
            with open(f'cleaned_{filename}', 'w') as f:
                f.write(cleaned_content)

Advanced Configuration

# Use non-strict mode to include references with formatting issues
cleaner = CitationHallucinationStop(strict_mode=False)

# Get detailed statistics
result = cleaner.clean_references(text, references)
stats = result['statistics']

print(f"Reduced references from {stats['total_references']} to {stats['filtered_references']}")
print(f"Match rate: {stats['matched_citations']}/{stats['total_citations']} citations matched")

Citation Formats Supported

The library recognizes these citation formats:

  • (Author, Year) - Single author
  • (Author & Author, Year) - Two authors
  • (Author et al., Year) - Multiple authors
  • (Author, Year; Author, Year) - Multiple citations
  • (Author, Year, p. 123) - Citations with page numbers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

v1.0.0

  • Initial release
  • Core citation matching functionality
  • APA format validation
  • Fuzzy matching for name variants
  • Detailed statistics and reporting

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citation_hallucination_stop-1.0.1.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citation_hallucination_stop-1.0.1-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file citation_hallucination_stop-1.0.1.tar.gz.

File metadata

File hashes

Hashes for citation_hallucination_stop-1.0.1.tar.gz
Algorithm Hash digest
SHA256 cc2e398736ac06440cbdaf5b8476efe6e25789a3dc63605d58c072092c0d8a77
MD5 d8c6903a17e04662562f2b3def1996d1
BLAKE2b-256 b8727177078fcce5f82ae3f9f24a3501f82902a66ed72d27ae0457a316ebe88d

See more details on using hashes here.

File details

Details for the file citation_hallucination_stop-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for citation_hallucination_stop-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0f931465dac6e4129e82cdc648f67263cf1eb6576a0a7c952395fcfdb51bc47a
MD5 3ca98a5fa3ca96c07f9a374c671ac7a6
BLAKE2b-256 d4dc5471b9a6826ee0998b905913dc49c5d503baa78f8531348c067c80ac23a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page