Skip to main content

Discover and curate scholarly citations of datasets and software

Project description

citations-collector

Discover and curate scholarly citations of datasets and software.

Features

  • Citation Discovery: Query CrossRef, OpenCitations, DataCite for citing papers
  • Hierarchical Collections: Organize citations by project/version (e.g., DANDI dandisets)
  • Git-Friendly: YAML collections + TSV citation records for version control
  • Curation Workflow: Mark citations as ignored, merge preprints with published versions
  • PDF Acquisition: Automatically download open-access PDFs via Unpaywall with optional git-annex tracking
  • Merge Detection: Auto-detect preprints with published versions using CrossRef relationships
  • Zotero Integration: Sync citations to hierarchical Zotero collections with automatic merged item relocation
  • Incremental Updates: Efficiently discover only new citations since last run

Installation

# Using uv (recommended)
uv venv
source .venv/bin/activate
uv pip install citations-collector

# Or using pip
pip install citations-collector

Quick Start

1. Create a Collection

Create collection.yaml:

name: My Research Tools
description: Software tools used in our lab
items:
  - item_id: my-tool
    name: "My Analysis Tool"
    flavors:
      - flavor_id: "1.0.0"
        refs:
          - ref_type: doi
            ref_value: "10.5281/zenodo.1234567"

2. Discover Citations

# Discover citations for all items in collection
citations-collector discover collection.yaml --output citations.tsv

# Use CrossRef polite pool (better rate limits)
citations-collector discover collection.yaml --email your@email.org

3. View Results

Citations are saved to citations.tsv - a tab-separated file you can open in Excel or edit manually for curation.

Advanced Workflows

PDF Acquisition

Automatically download open-access PDFs using Unpaywall:

# Fetch PDFs for discovered citations
citations-collector fetch-pdfs --config collection.yaml

# Use git-annex for provenance tracking
citations-collector fetch-pdfs --config collection.yaml --git-annex

# Dry run to see what would be downloaded
citations-collector fetch-pdfs --config collection.yaml --dry-run

PDFs are stored at pdfs/{doi}/article.pdf with accompanying article.bib BibTeX files.

Merge Detection

Detect preprints that have published versions:

# Detect merges via CrossRef relationships
citations-collector detect-merges --config collection.yaml

# Also run fuzzy title matching (use with caution)
citations-collector detect-merges --config collection.yaml --fuzzy-match

# Preview without updating
citations-collector detect-merges --config collection.yaml --dry-run

Detected preprints are marked with citation_status=merged and citation_merged_into={published_doi}.

Zotero Sync

Sync citations to Zotero for collaborative browsing:

# Sync to Zotero (requires API key in config or env)
citations-collector sync-zotero --config collection.yaml

# Dry run to preview structure
citations-collector sync-zotero --config collection.yaml --dry-run

Zotero hierarchy:

Top Collection/
  ├── {item_id}/
  │   ├── {flavor}/
  │   │   ├── <active citations>
  │   │   └── Merged/
  │   │       └── <preprints and old versions>

Unified Configuration

Create a unified collection.yaml with all settings:

name: My Research Collection
description: Tools and datasets from our lab

# Source items to track
source:
  items:
    - item_id: dandi-000055
      name: "AJILE12: Long-term naturalistic human intracranial neural recordings"
      flavors:
        - flavor_id: "0.220113.0400"
          refs:
            - ref_type: doi
              ref_value: "10.48324/dandi.000055/0.220113.0400"

# Citation discovery settings
discover:
  sources:
    - crossref
    - opencitations
  email: your@email.org  # For CrossRef polite pool
  incremental: true

# PDF acquisition settings (optional)
pdfs:
  output_dir: pdfs/
  unpaywall_email: your@email.org
  git_annex: false

# Zotero sync settings (optional)
zotero:
  library_type: group
  library_id: "12345"
  api_key: "YOUR_API_KEY"  # Or set ZOTERO_API_KEY env var
  top_collection_key: "ABCD1234"

Then run the full workflow:

# 1. Discover citations
citations-collector discover collection.yaml

# 2. Fetch open-access PDFs
citations-collector fetch-pdfs --config collection.yaml

# 3. Detect merged preprints
citations-collector detect-merges --config collection.yaml

# 4. Sync to Zotero
citations-collector sync-zotero --config collection.yaml

Library Usage

from citations_collector import CitationCollector

# Load collection
collector = CitationCollector.from_yaml("collection.yaml")

# Discover citations (incremental by default)
collector.discover_all(incremental=True, email="your@email.org")

# Save results
collector.save("collection.yaml", "citations.tsv")

Examples

See the examples/ directory for:

  • dandi-collection.yaml: DANDI Archive dandisets with versioned DOIs
  • repronim-tools.yaml: ReproNim neuroimaging tools with RRIDs
  • simple-resources.yaml: Basic collection without versioning
  • citations-example.tsv: Example citation records with curation

Development

Setup

# Clone repository
git clone https://github.com/dandi/citations-collector.git
cd citations-collector

# Setup development environment
uv venv
source .venv/bin/activate
uv pip install -e ".[devel]"

Running Tests

# Run all tests, linting, and type checking
tox

# Run specific environment
tox -e py312      # Tests on Python 3.12
tox -e lint       # Ruff linting
tox -e type       # Mypy type checking
tox -e cov        # Coverage report

Regenerating LinkML Models

When schema/citations.yaml changes:

# Install linkml tools
uv pip install -e ".[linkml]"

# Regenerate Pydantic models
gen-pydantic schema/citations.yaml > src/citations_collector/models/generated.py

# Regenerate JSON Schema
gen-json-schema schema/citations.yaml > schema/citations.schema.json

# Commit generated files
git add src/citations_collector/models/generated.py schema/citations.schema.json
git commit -m "Regenerate LinkML models"

Architecture

  • Library-First Design: All functionality accessible programmatically
  • LinkML Schema: Validated data models from schema/citations.yaml
  • Modular Structure:
    • discovery/: Citation API clients (CrossRef, OpenCitations, DataCite)
    • persistence/: YAML/TSV I/O
    • importers/: DANDI API, Zenodo, GitHub integrations
    • unpaywall.py: Unpaywall API client for OA PDF URLs
    • pdf.py: PDF acquisition with git-annex support
    • merge_detection.py: Preprint/published version detection
    • zotero_sync.py: Zotero hierarchical sync with merged item handling
    • core.py: Main orchestration API
    • cli.py: Click-based CLI (thin wrapper)

Citation Sources

  • CrossRef: Most comprehensive, best for DOI citations
  • OpenCitations: Open index, may lag behind CrossRef
  • DataCite: Good for dataset citations
  • Europe PMC: PubMed-indexed papers (future)
  • Semantic Scholar: AI-powered citation discovery (future)

License

MIT License - see LICENSE file for details.

Contributing

See CONSTITUTION.md for:

  • Code standards (Ruff, mypy, type hints)
  • Testing requirements (pytest, 100 lines max, mock HTTP)
  • Architecture principles (library-first, reliability, simplicity)

Pull requests welcome!

Citation

If you use citations-collector in your research, please cite:

@software{citations_collector,
  title = {citations-collector: Discover and curate scholarly citations},
  author = {{DANDI Team}},
  url = {https://github.com/dandi/citations-collector},
  license = {MIT}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citations_collector-0.2.1.tar.gz (76.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citations_collector-0.2.1-py3-none-any.whl (54.9 kB view details)

Uploaded Python 3

File details

Details for the file citations_collector-0.2.1.tar.gz.

File metadata

  • Download URL: citations_collector-0.2.1.tar.gz
  • Upload date:
  • Size: 76.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for citations_collector-0.2.1.tar.gz
Algorithm Hash digest
SHA256 ed5f3a57d3067f5fb6e4d7bb690fbd87e39c9e7ede0e46fe944cc37674a034d6
MD5 34c56e3e56ca4616e5244438deb3d7c0
BLAKE2b-256 1125b8aa6353f70705f99389f70500d74b31fc2dfd2e2d93d6ae34c629c8197f

See more details on using hashes here.

File details

Details for the file citations_collector-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for citations_collector-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 19b0d5995d2bca9923cd0df1ca2cee6312df08647c4bc4bee3a407f55d1284e6
MD5 191f76215310f0c16a22c92a106326e7
BLAKE2b-256 2519cb78743e42b6b8a16f39e9778eab606b38767515e6126aad6a86ea875156

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page