Skip to main content

A comprehensive BibTeX bibliography management toolkit for academic research

Project description

Awesome Citations

Awesome Citations is a comprehensive Python-based BibTeX bibliography management toolkit that automates the entire citation workflow. From completing incomplete entries by fetching from official sources (IEEE, ACM, arXiv, CrossRef, Semantic Scholar), to standardizing formatting, replacing arXiv preprints with published versions, and generating formatted PDF bibliographies - all in a single command.

Perfect for managing bibliographies for academic papers, theses, and research projects with support for multiple citation styles and bilingual (English/Chinese) journals.

Table of Contents

Features

Core Capabilities

  • ๐Ÿš€ Complete Workflow Automation - One-stop BibTeX processing pipeline in a single command

    • Sort and deduplicate entries by ID
    • Complete missing fields from multiple sources
    • Standardize formatting across all entries
    • Replace arXiv preprints with published versions
    • Generate detailed change logs
    • Create formatted PDF bibliographies
  • ๐Ÿ“š Multi-Source BibTeX Completion - Fetch missing fields from 5 official sources:

    • IEEE Xplore - Primary source for IEEE publications (with Selenium fallback)
    • ACM Digital Library - For ACM publications
    • arXiv API - For preprints and arXiv papers
    • CrossRef - Universal fallback for any DOI
    • Semantic Scholar - For published versions of arXiv papers
    • Intelligent fallback chains ensure maximum success rate
  • ๐Ÿ”„ Smart arXiv Preprint Replacement

    • Automatically detects arXiv preprints in your bibliography
    • Searches for published versions using Semantic Scholar, DBLP, and CrossRef APIs
    • Replaces preprint entries with complete journal/conference publication data
    • Preserves original entry IDs for reference consistency
  • โœจ Comprehensive Field Standardization

    • Title formatting: Title Case, Sentence case, with protected acronyms (IoT, WiFi, etc.)
    • Author formatting: First-last or Last-first name ordering
    • Journal normalization: Full names, abbreviations, or preserve both (50+ journal mappings)
    • Page formatting: LaTeX double-dash (100--110) or single-dash (100-110)
  • ๐Ÿ“Š PDF Bibliography Generation

    • Support for 4 citation styles: IEEE, ACM, APA, GB/T 7714 (Chinese standard)
    • Customizable templates for each style
    • Automatic LaTeX compilation with biber
    • Configurable document title, font size, paper size
  • ๐Ÿ“ Detailed Change Tracking

    • Every modification is logged with before/after values
    • Source attribution (which API provided the data)
    • Markdown-formatted change reports
    • Summary statistics (entries processed, fields added, errors)
  • ๐Ÿ” Analysis Tools

    • Analyze bibliography by reference types (article, inproceedings, etc.)
    • Publication year distribution (sorted newest first)
    • Publication venue frequency (sorted by count)
    • Formatted table output
  • ๐ŸŒ Bilingual Support

    • English and Chinese journal name mappings
    • Chinese journal metadata database
    • GB/T 7714 citation style for Chinese standards
    • Bilingual document titles support
  • ๐Ÿ›ก๏ธ Robust Error Handling

    • DOI validation via HEAD requests to doi.org
    • Title similarity checks (60% word overlap required)
    • Year consistency validation (ยฑ1 year tolerance)
    • Failed DOI tracking in /data/failed_dois.json
    • Manual DOI correction database support
    • Rate limiting to respect API limits (configurable delays)

Installation

Prerequisites

  • Python 3.8+ (tested on Python 3.12)
  • uv - Modern Python package manager (recommended) or pip
  • LaTeX distribution (optional, required for PDF generation)
    • macOS: brew install --cask mactex
    • Ubuntu/Debian: sudo apt-get install texlive-full
    • Windows: Install MiKTeX or TeX Live

Install uv (if not already installed)

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or via pip
pip install uv

Install Python Dependencies

Method 1: Using uv (Recommended - Fast & Modern)

# Install all dependencies
uv pip install -r requirements.txt

# Or create and activate a virtual environment with uv
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

Method 2: Using pip (Traditional)

pip install -r requirements.txt

Method 3: Manual installation with uv

uv pip install bibtexparser tabulate requests beautifulsoup4 lxml pyyaml habanero scholarly

Optional dependencies for enhanced features:

# For IEEE Selenium fallback (if API fails)
uv pip install selenium webdriver-manager

# For testing and development
uv pip install pytest pytest-cov pytest-timeout pytest-mock

Verify Installation

# Test the complete workflow
uv run python scripts/workflow_complete.py examples/sample_input.bib --output output/test_result.bib

# Or if using activated virtual environment
python3 scripts/workflow_complete.py examples/sample_input.bib --output output/test_result.bib

# Check if LaTeX is installed (for PDF generation)
pdflatex --version
biber --version

Quick Start

Process your BibTeX file in one command:

# Using uv (recommended)
uv run python scripts/workflow_complete.py refs.bib

# Or with activated virtual environment
python3 scripts/workflow_complete.py refs.bib

This will create:

  • refs_completed.bib - Your processed bibliography
  • refs_completed_changes.md - Detailed change log
  • refs_completed.pdf - Formatted PDF (if LaTeX is installed)

With custom configuration:

uv run python scripts/workflow_complete.py refs.bib --output output/my_refs.bib --config config.json

Usage

๐Ÿš€ Complete Workflow (Recommended)

The workflow_complete.py script orchestrates the entire BibTeX processing pipeline in a single command. This is the recommended way to use Awesome Citations.

Basic usage:

# Using uv (recommended - handles dependencies automatically)
uv run python scripts/workflow_complete.py input.bib

# Or with activated virtual environment
python3 scripts/workflow_complete.py input.bib

What it does (7 automated steps):

  1. โœ… Sort and deduplicate entries by ID
  2. โœ… Complete missing fields from multiple sources (IEEE/ACM/arXiv/CrossRef/Semantic Scholar)
  3. โœ… Standardize formatting (titles, authors, journals, pages)
  4. โœ… Replace arXiv preprints with published versions
  5. โœ… Write output file with all changes
  6. โœ… Generate change summary (Markdown report with statistics)
  7. โœ… Generate PDF bibliography (IEEE style by default)

Output files:

  • input_completed.bib - Processed BibTeX file
  • input_completed_changes.md - Detailed change log
  • input_completed.pdf - Formatted PDF (if LaTeX is installed)

Custom output path:

uv run python scripts/workflow_complete.py refs.bib --output output/completed.bib --config config.json

Process multiple files:

# Process all .bib files in a directory
for file in *.bib; do
    uv run python scripts/workflow_complete.py "$file"
done

Key features:

  • Progress tracking with detailed console output
  • Error handling with fallback chains
  • Rate limiting (1.0s delay between requests by default)
  • Preserves original entry IDs
  • Logs all failures for manual review

๐Ÿ“– For detailed documentation, see docs/WORKFLOW_GUIDE.md


Individual Tools

For specific tasks, you can use individual scripts. Each script focuses on a single aspect of bibliography management.

1. Complete BibTeX Entries

Complete incomplete BibTeX entries by automatically fetching missing information from official sources using DOI.

File: scripts/complete_bibtex.py

Usage:

uv run python scripts/complete_bibtex.py

Edit the script to set input and output file paths, or modify to accept command-line arguments.

Features:

  • Extracts DOI from entry fields (doi, url, note) or URLs
  • Validates DOI existence via HEAD request to doi.org
  • Identifies publisher from DOI prefix (10.1109 = IEEE, 10.1145 = ACM, etc.)
  • Fetches from appropriate source with intelligent fallback chain:
    • IEEE Xplore โ†’ Selenium fallback โ†’ CrossRef โ†’ Google Scholar
    • ACM Digital Library โ†’ CrossRef โ†’ Google Scholar
    • arXiv API โ†’ CrossRef โ†’ Google Scholar
  • Validates fetched data (title similarity, year consistency, DOI match)
  • Logs failed attempts to /data/failed_dois.json
  • Respects rate limits (0.5s default delay)

Note: This feature uses web scraping and may be affected by website changes.

2. Format BibTeX Fields

Standardize field formatting across all entries according to your preferences.

File: scripts/format_bibtex.py

Usage:

uv run python scripts/format_bibtex.py input.bib output.bib config.json

Formatting options:

  • Title formatting:

    • titlecase: "A Survey on Machine Learning Techniques"
    • sentencecase: "A survey on machine learning techniques"
    • Protects acronyms: "{IoT}-Based {WiFi} System"
  • Author formatting:

    • first_last: "John Smith and Jane Doe"
    • last_first: "Smith, John and Doe, Jane"
  • Journal formatting:

    • abbreviation: "IEEE Trans. Pattern Anal. Mach. Intell."
    • full: "IEEE Transactions on Pattern Analysis and Machine Intelligence"
    • both: Keep original format
  • Page formatting:

    • double_dash: "100--110" (LaTeX format)
    • single_dash: "100-110"

3. Sort BibTeX File

Sort entries alphabetically by their citation keys (entry IDs).

File: scripts/sort_bibtex.py

Usage:

uv run python scripts/sort_bibtex.py

Edit the script to set input and output file paths.

Features:

  • Alphabetical sorting by entry ID
  • Preserves entry formatting
  • Removes duplicate entries with the same ID

4. Analyze BibTeX File

Generate statistical analysis and tables for your bibliography.

File: scripts/analyze_bibtex.py

Usage:

uv run python scripts/analyze_bibtex.py

Edit the script to set the input file path.

Outputs three formatted tables:

  1. Reference Types (article, inproceedings, book, etc.)

    Type            Count
    -------------  -------
    article             45
    inproceedings       32
    book                 8
    
  2. Publication Years (sorted newest first)

    Year    Count
    ------  -------
    2024         15
    2023         28
    2022         22
    
  3. Publication Venues (sorted by frequency)

    Publication                                    Count
    -------------------------------------------  -------
    IEEE Transactions on Neural Networks            12
    ACM Computing Surveys                             8
    Nature Machine Intelligence                       5
    

5. Generate PDF Bibliography

Create a formatted PDF bibliography using LaTeX templates.

File: scripts/generate_pdf.py

Usage:

uv run python scripts/generate_pdf.py input.bib output.pdf ieee config.json

Supported citation styles:

  • ieee - IEEE numeric citations (default)
  • acm - ACM author-year citations
  • apa - APA psychology standard
  • gb7714 - Chinese GB/T 7714 standard

Requirements:

  • LaTeX distribution (pdflatex + biber)
  • Templates in /templates/ directory

Customization options (via config.json):

{
  "pdf_output": {
    "enabled": true,
    "document_title": "ๅ‚่€ƒๆ–‡็Œฎๅˆ—่กจ / References",
    "sort_by": "author",          // or "year", "title"
    "font_size": "11pt",           // or "10pt", "12pt"
    "paper_size": "a4paper"        // or "letterpaper"
  }
}

6. Replace arXiv Preprints

Detect arXiv preprints and replace them with published versions.

File: utils/arxiv_detector.py

Usage:

from utils.arxiv_detector import detect_and_replace_arxiv

# Automatically called in workflow_complete.py
entries_updated = detect_and_replace_arxiv(bib_database)

Detection methods:

  • Checks journal field for "arXiv"
  • Checks eprint field for arXiv ID
  • Checks DOI for arXiv format (10.48550/arXiv.*)

Search strategy:

  1. Semantic Scholar API (most reliable for arXiv papers)
  2. DBLP API (excellent for CS papers)
  3. CrossRef API (comprehensive coverage)

Filters:

  • Only replaces with journal or conference publications
  • Ignores other preprints
  • Preserves original entry ID

Configuration

Awesome Citations uses a JSON configuration file to control all aspects of the processing workflow. The default configuration is in config.json.

Configuration File Structure

{
  "citation_style": "ieee",
  "custom_biblatex_style": null,
  "title_format": "titlecase",
  "journal_format": "both",
  "author_format": "first_last",
  "page_format": "double_dash",
  "arxiv_handling": "replace_with_published",
  "data_source_priority": ["doi_official", "dblp", "crossref"],
  "merge_multiple_sources": true,
  "parallel_processing": true,
  "max_workers": 5,
  "request_delay": 1.0,
  "pdf_output": {
    "enabled": true,
    "document_title": "ๅ‚่€ƒๆ–‡็Œฎๅˆ—่กจ / References",
    "sort_by": "author",
    "font_size": "11pt",
    "paper_size": "a4paper"
  },
  "logging": {
    "enabled": true,
    "output_file": "changes_log.md",
    "verbose": true
  }
}

Configuration Options Explained

General Settings

  • citation_style (string): PDF citation style

    • Options: "ieee", "acm", "apa", "gb7714"
    • Default: "ieee"
  • custom_biblatex_style (string or null): Custom biblatex style file path

    • Use if you have a custom .bbx/.cbx file
    • Default: null

Formatting Options

  • title_format (string): How to format entry titles

    • "titlecase": "A Survey on Machine Learning" (capitalize major words)
    • "sentencecase": "A survey on machine learning" (only first word)
    • "preserve": Keep original formatting
    • Default: "titlecase"
  • journal_format (string): Journal name formatting

    • "abbreviation": "IEEE Trans. Pattern Anal." (abbreviated)
    • "full": "IEEE Transactions on Pattern Analysis..." (full name)
    • "both": Preserve original format
    • Default: "both"
  • author_format (string): Author name ordering

    • "first_last": "John Smith and Jane Doe"
    • "last_first": "Smith, John and Doe, Jane"
    • Default: "first_last"
  • page_format (string): Page number dash style

    • "double_dash": "100--110" (LaTeX standard)
    • "single_dash": "100-110"
    • Default: "double_dash"

Data Fetching Options

  • arxiv_handling (string): How to handle arXiv preprints

    • "replace_with_published": Auto-replace with published versions
    • "keep": Preserve arXiv entries as-is
    • Default: "replace_with_published"
  • data_source_priority (array): Order of data sources to try

    • Options: "doi_official" (IEEE/ACM/arXiv), "dblp", "crossref", "google_scholar"
    • Default: ["doi_official", "dblp", "crossref"]
    • First successful source is used (unless merge_multiple_sources is true)
  • merge_multiple_sources (boolean): Merge data from multiple sources

    • true: Fetch from all sources and intelligently merge fields
    • false: Use only first successful source
    • Default: true

Performance Options

  • parallel_processing (boolean): Enable concurrent processing

    • true: Process multiple entries simultaneously (faster)
    • false: Process sequentially (safer for debugging)
    • Default: true
  • max_workers (integer): Number of concurrent worker threads

    • Range: 1-10 (recommended: 3-5)
    • Default: 5
    • Only applies if parallel_processing is true
  • request_delay (float): Delay between API requests (seconds)

    • Prevents rate limiting from APIs
    • Range: 0.2-5.0 seconds
    • Default: 1.0
    • Recommended: 0.5 for personal use, 1.0+ for large batches

PDF Output Settings

  • pdf_output.enabled (boolean): Generate PDF bibliography

    • Default: true
  • pdf_output.document_title (string): Title on PDF document

    • Default: "ๅ‚่€ƒๆ–‡็Œฎๅˆ—่กจ / References" (bilingual)
  • pdf_output.sort_by (string): Bibliography sorting order

    • Options: "author", "year", "title", "id"
    • Default: "author"
  • pdf_output.font_size (string): Document font size

    • Options: "10pt", "11pt", "12pt"
    • Default: "11pt"
  • pdf_output.paper_size (string): Paper size

    • Options: "a4paper", "letterpaper", "a5paper"
    • Default: "a4paper"

Logging Settings

  • logging.enabled (boolean): Enable change logging

    • Default: true
  • logging.output_file (string): Change log filename

    • Default: "changes_log.md"
  • logging.verbose (boolean): Include detailed logs

    • true: Log every field change
    • false: Only log summary statistics
    • Default: true

Example Configurations

Minimal processing (fast, conservative):

{
  "title_format": "preserve",
  "journal_format": "both",
  "arxiv_handling": "keep",
  "merge_multiple_sources": false,
  "parallel_processing": true,
  "max_workers": 10,
  "request_delay": 0.5
}

Maximum quality (slow, comprehensive):

{
  "title_format": "titlecase",
  "journal_format": "full",
  "arxiv_handling": "replace_with_published",
  "merge_multiple_sources": true,
  "parallel_processing": false,
  "request_delay": 2.0,
  "logging": {
    "verbose": true
  }
}

Chinese academic papers:

{
  "citation_style": "gb7714",
  "title_format": "preserve",
  "journal_format": "full",
  "pdf_output": {
    "document_title": "ๅ‚่€ƒๆ–‡็Œฎๅˆ—่กจ",
    "paper_size": "a4paper",
    "font_size": "12pt"
  }
}

Data Sources

Awesome Citations fetches bibliographic data from multiple authoritative sources:

Primary Sources (Publisher APIs)

  1. IEEE Xplore - For IEEE publications (DOI: 10.1109/*)

    • Primary: POST API to /xpl/downloadCitations
    • Fallback: Selenium browser automation
    • Coverage: IEEE journals, conferences, standards
  2. ACM Digital Library - For ACM publications (DOI: 10.1145/*)

    • Web scraping from /doi/{doi}/bibtex
    • Coverage: ACM journals, conferences, magazines
  3. arXiv API - For preprints (DOI: 10.48550/arXiv.*)

    • Official arXiv public API
    • Coverage: All arXiv papers

Fallback Sources

  1. CrossRef API - Universal DOI resolver

    • REST API for any DOI
    • Coverage: 130+ million records across all publishers
    • Very reliable for basic metadata
  2. Semantic Scholar API - Academic search engine

    • Best for finding published versions of arXiv papers
    • Provides publication venue information
    • Coverage: 200+ million papers
  3. DBLP API - Computer Science bibliography

    • Excellent for CS conference and journal papers
    • Provides canonical venue names
    • Coverage: 5+ million CS publications
  4. Google Scholar - Final fallback (title-based search)

    • Used when DOI lookup fails
    • Searches by title
    • Rate-limited, used sparingly

Fallback Strategy

Each entry follows an intelligent fallback chain:

1. Extract DOI from entry
   โ†“
2. Validate DOI exists (HEAD request to doi.org)
   โ†“
3. Identify publisher from DOI prefix
   โ†“
4. Try primary source (IEEE/ACM/arXiv API)
   โ†“
5. If fails โ†’ Try CrossRef
   โ†“
6. If fails โ†’ Try Google Scholar
   โ†“
7. If fails โ†’ Log to failed_dois.json

For arXiv papers specifically:

1. Check if entry is arXiv preprint
   โ†“
2. Try Semantic Scholar API (find published version)
   โ†“
3. If found โ†’ Fetch complete BibTeX from new DOI
   โ†“
4. If not found โ†’ Try DBLP API
   โ†“
5. If not found โ†’ Try CrossRef API
   โ†“
6. If no published version โ†’ Keep arXiv entry

Data Validation

All fetched data undergoes validation:

  • Title similarity: At least 60% word overlap with original title
  • Year consistency: Within ยฑ1 year of original (if present)
  • DOI match: Exact match with expected DOI
  • Field completeness: Must have at minimum: title, author, year

Failed validations are logged and the entry is skipped.

Manual Corrections

For persistent DOI fetch failures, you can add manual corrections to /data/doi_corrections.json:

{
  "10.1109/EXAMPLE.2024.1234567": {
    "title": "Correct Title",
    "author": "Smith, John and Doe, Jane",
    "journal": "IEEE Transactions on Example",
    "year": "2024",
    "volume": "15",
    "number": "3",
    "pages": "100--110",
    "doi": "10.1109/EXAMPLE.2024.1234567"
  }
}

These corrections are checked before attempting API fetches.

Project Structure

Awesome-Citations/
โ”œโ”€โ”€ README.md                   # This file
โ”œโ”€โ”€ LICENSE                     # GNU GPL v3.0
โ”œโ”€โ”€ requirements.txt            # Python dependencies
โ”œโ”€โ”€ config.json                 # Default configuration
โ”œโ”€โ”€ refs.bib                    # Example BibTeX file
โ”‚
โ”œโ”€โ”€ scripts/                    # Main executable scripts
โ”‚   โ”œโ”€โ”€ workflow_complete.py   # ๐Ÿš€ Complete workflow (recommended)
โ”‚   โ”œโ”€โ”€ complete_bibtex.py     # BibTeX completion from APIs
โ”‚   โ”œโ”€โ”€ format_bibtex.py       # Field standardization
โ”‚   โ”œโ”€โ”€ sort_bibtex.py         # Alphabetical sorting
โ”‚   โ”œโ”€โ”€ analyze_bibtex.py      # Statistical analysis
โ”‚   โ”œโ”€โ”€ generate_pdf.py        # PDF generation
โ”‚   โ”œโ”€โ”€ utilities.py           # Core utility functions
โ”‚   โ””โ”€โ”€ enhanced_complete.py   # Enhanced multi-source completion
โ”‚
โ”œโ”€โ”€ utils/                      # Utility modules
โ”‚   โ”œโ”€โ”€ change_logger.py       # Change tracking and logging
โ”‚   โ”œโ”€โ”€ arxiv_detector.py      # arXiv preprint detection
โ”‚   โ”œโ”€โ”€ title_formatter.py     # Title formatting utilities
โ”‚   โ””โ”€โ”€ multi_source_merger.py # Multi-source data merging
โ”‚
โ”œโ”€โ”€ data/                       # Data files and databases
โ”‚   โ”œโ”€โ”€ journal_abbr.json      # Journal abbreviation mappings (50+)
โ”‚   โ”œโ”€โ”€ protected_words.json   # Acronyms to protect in titles
โ”‚   โ”œโ”€โ”€ small_words.json       # Articles/prepositions for Title Case
โ”‚   โ”œโ”€โ”€ chinese_journals.json  # Chinese journal metadata
โ”‚   โ”œโ”€โ”€ doi_corrections.json   # Manual DOI corrections
โ”‚   โ””โ”€โ”€ failed_dois.json       # Failed DOI fetch log (auto-generated)
โ”‚
โ”œโ”€โ”€ templates/                  # LaTeX templates for PDF generation
โ”‚   โ”œโ”€โ”€ ieee_template.tex      # IEEE citation style
โ”‚   โ”œโ”€โ”€ acm_template.tex       # ACM citation style
โ”‚   โ”œโ”€โ”€ apa_template.tex       # APA citation style
โ”‚   โ””โ”€โ”€ gb7714_template.tex    # Chinese GB/T 7714 standard
โ”‚
โ”œโ”€โ”€ docs/                       # Documentation (16 files)
โ”‚   โ”œโ”€โ”€ WORKFLOW_GUIDE.md      # Complete workflow documentation
โ”‚   โ”œโ”€โ”€ USAGE_GUIDE.md         # Detailed usage instructions
โ”‚   โ”œโ”€โ”€ TEST_REPORT.md         # Testing results and coverage
โ”‚   โ”œโ”€โ”€ IEEE_FAILURE_ANALYSIS.md # IEEE API troubleshooting
โ”‚   โ””โ”€โ”€ ...                    # Implementation and optimization docs
โ”‚
โ”œโ”€โ”€ examples/                   # Example files
โ”‚   โ”œโ”€โ”€ sample_input.bib       # Sample BibTeX entries
โ”‚   โ””โ”€โ”€ sample_config.json     # Example configuration
โ”‚
โ”œโ”€โ”€ tests/                      # Comprehensive test suite (2,579 lines)
โ”‚   โ”œโ”€โ”€ conftest.py            # Pytest fixtures and configuration
โ”‚   โ”œโ”€โ”€ test_complete_bibtex.py # BibTeX completion tests (1,057 lines)
โ”‚   โ”œโ”€โ”€ test_format_bibtex.py  # Formatting tests (485 lines)
โ”‚   โ”œโ”€โ”€ test_ieee_integration.py # IEEE-specific tests (401 lines)
โ”‚   โ”œโ”€โ”€ test_integration.py    # End-to-end workflow tests (171 lines)
โ”‚   โ”œโ”€โ”€ test_utils.py          # Utility function tests (161 lines)
โ”‚   โ””โ”€โ”€ test_data/             # Test BibTeX files (14+ files)
โ”‚       โ”œโ”€โ”€ ieee_papers.bib
โ”‚       โ”œโ”€โ”€ acm_papers.bib
โ”‚       โ”œโ”€โ”€ arxiv_papers.bib
โ”‚       โ”œโ”€โ”€ mixed_entries.bib
โ”‚       โ””โ”€โ”€ ...
โ”‚
โ”œโ”€โ”€ research/                   # Research and exploration scripts
โ”‚   โ””โ”€โ”€ ieee_api_research.py   # IEEE API endpoint exploration
โ”‚
โ”œโ”€โ”€ output/                     # Generated outputs (auto-created)
โ”‚   โ”œโ”€โ”€ *.bib                  # Processed BibTeX files
โ”‚   โ”œโ”€โ”€ *_changes.md           # Change logs
โ”‚   โ””โ”€โ”€ *.pdf                  # Generated PDFs
โ”‚
โ””โ”€โ”€ .cache/                     # API response cache (auto-created)
    โ””โ”€โ”€ *.json                 # Cached responses (30-day expiry)

Key Directories Explained

  • scripts/ - All main executable scripts. Start here for any task.
  • utils/ - Reusable Python modules imported by scripts.
  • data/ - JSON databases for journal names, acronyms, corrections, and failure logs.
  • templates/ - LaTeX templates for generating PDFs in different citation styles.
  • docs/ - Comprehensive documentation covering implementation, testing, and usage.
  • examples/ - Sample files to test the tool and learn usage patterns.
  • tests/ - Full test suite with fixtures, test data, and comprehensive coverage.
  • output/ - Auto-created directory for all generated files.
  • .cache/ - Auto-created cache for API responses to reduce redundant requests.

Documentation

Comprehensive documentation is available in the /docs directory:

User Documentation

  • WORKFLOW_GUIDE.md - Complete workflow documentation

    • Step-by-step walkthrough of the complete workflow
    • Configuration options explained
    • Troubleshooting common issues
  • USAGE_GUIDE.md - Detailed usage instructions

    • Individual script usage examples
    • Advanced configuration scenarios
    • Best practices for different use cases

Developer Documentation

  • IMPLEMENTATION_SUMMARY.md - Implementation details

    • Architecture overview
    • Core algorithms and data flows
    • API integration details
  • TEST_REPORT.md - Testing results and coverage

    • Test suite overview
    • Coverage statistics
    • Known issues and limitations
  • IEEE_FAILURE_ANALYSIS.md - IEEE API troubleshooting

    • Common IEEE API errors
    • Selenium fallback strategies
    • Debugging tips

Project History

  • COMPLETION_SUMMARY.md - BibTeX completion feature development
  • OPTIMIZATION_RESULTS.md - Performance optimization results
  • FILE_REORGANIZATION_SUMMARY.md - Project restructuring notes
  • WORKFLOW_IMPLEMENTATION_SUMMARY.md - Complete workflow development

All documentation files are in Markdown format and contain detailed explanations with code examples.

Testing

Awesome Citations includes a comprehensive test suite to ensure reliability and correctness.

Test Suite Overview

  • Total test code: 2,579 lines
  • Test files: 6 main test files
  • Test data: 14+ specialized BibTeX files
  • Framework: pytest with coverage, timeout, and mock support

Running Tests

Run all tests:

pytest tests/

Run with coverage report:

pytest --cov=scripts --cov=utils --cov-report=html tests/

Run specific test file:

pytest tests/test_complete_bibtex.py -v

Run tests matching a pattern:

pytest tests/ -k "test_ieee" -v

Test Categories

  1. Unit Tests (test_utils.py)

    • Core utility functions
    • Deduplication logic
    • Sorting algorithms
  2. BibTeX Completion Tests (test_complete_bibtex.py)

    • DOI extraction and validation
    • Publisher identification
    • API fetching and fallback chains
    • Data validation
    • Error handling
  3. Formatting Tests (test_format_bibtex.py)

    • Title formatting (Title Case, Sentence case)
    • Author name formatting
    • Journal normalization
    • Page formatting
  4. IEEE Integration Tests (test_ieee_integration.py)

    • IEEE API integration
    • Selenium fallback testing
    • Real IEEE DOI fetching
    • Error scenarios
  5. End-to-End Tests (test_integration.py)

    • Complete workflow execution
    • Configuration handling
    • Output file generation
    • Change log creation

Test Data Files

Located in /tests/test_data/:

  • ieee_papers.bib - IEEE journal and conference papers
  • acm_papers.bib - ACM publications
  • arxiv_papers.bib - arXiv preprints
  • crossref_papers.bib - CrossRef-sourced entries
  • duplicate_entries.bib - Duplicate detection tests
  • malformed_entries.bib - Error handling tests
  • mixed_entries.bib - Mixed publisher entries
  • chinese_journals.bib - Chinese publications
  • And more...

Continuous Testing

The test suite is designed to:

  • Catch regressions early
  • Validate API integrations
  • Ensure data quality
  • Test error handling paths
  • Verify configuration options

For detailed test results, see docs/TEST_REPORT.md

Troubleshooting

Common Issues

1. IEEE API fails frequently

  • The IEEE Xplore API can be unstable
  • Selenium fallback is automatically used
  • See docs/IEEE_FAILURE_ANALYSIS.md for details
  • Consider increasing request_delay in config

2. LaTeX not found

  • PDF generation requires LaTeX (pdflatex + biber)
  • Install TeX Live (Linux), MacTeX (macOS), or MiKTeX (Windows)
  • Verify: pdflatex --version and biber --version

3. Rate limiting errors

  • Increase request_delay in config.json (try 2.0 or higher)
  • Reduce max_workers (try 3 or less)
  • Some APIs have strict rate limits (especially Google Scholar)

4. Title similarity validation fails

  • Fetched data must match original title (60% word overlap)
  • Check if original title is correct in your .bib file
  • Add manual correction to /data/doi_corrections.json

5. DOI not found

  • Check /data/failed_dois.json for error details
  • Verify DOI is correct and exists at https://doi.org/
  • Some DOIs may not be indexed by all APIs

6. Module import errors

  • Ensure all dependencies are installed: pip install -r requirements.txt
  • Check Python version: 3.8+ required (tested on 3.12)

Getting Help

If you encounter issues not covered here:

  1. Check the documentation in /docs
  2. Review test files for usage examples
  3. Check GitHub issues (if applicable)
  4. Review error logs in console output and /data/failed_dois.json

Contributing

Contributions are welcome! If you'd like to contribute to Awesome Citations:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run the test suite to ensure nothing breaks (pytest tests/)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Areas for Contribution

  • Adding support for new citation styles
  • Improving API reliability and fallback strategies
  • Adding new data sources (e.g., Springer, Elsevier APIs)
  • Enhancing journal abbreviation database
  • Writing additional tests
  • Improving documentation
  • Fixing bugs and issues

Acknowledgment

Special thanks to:

  • Claude Code for providing valuable assistance in the development of this project
  • All contributors to the open-source libraries used in this project (bibtexparser, requests, BeautifulSoup, etc.)
  • The maintainers of IEEE Xplore, ACM Digital Library, arXiv, CrossRef, Semantic Scholar, and DBLP for their excellent APIs

License

GNU General Public License v3.0

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.


Built with โค๏ธ for researchers, academics, and anyone managing BibTeX bibliographies

For issues, questions, or suggestions, please check the documentation or open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awesome_citations-0.2.0.tar.gz (99.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

awesome_citations-0.2.0-py3-none-any.whl (74.8 kB view details)

Uploaded Python 3

File details

Details for the file awesome_citations-0.2.0.tar.gz.

File metadata

  • Download URL: awesome_citations-0.2.0.tar.gz
  • Upload date:
  • Size: 99.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for awesome_citations-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3a1b95aa306f342a26ebcc35f3902047553639d09b1f2209ac8a0eade5f88e46
MD5 36e6fd122d3e48463ae7839696eabe42
BLAKE2b-256 0fc80f3e11b31137afc9b8c429a04f00e655121aae7685500479609873767ad9

See more details on using hashes here.

File details

Details for the file awesome_citations-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for awesome_citations-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5caa080934ec1e7af317b7bd8fdc184180b3ac4e325260b48680ca5856e05739
MD5 52895b11f02a368179803cbaa1cb78cb
BLAKE2b-256 cd77e13ece19bb4ff26488031ced6fcfb39e42286af3818062e646fa39e7ee57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page