Discover and curate scholarly citations of datasets and software

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
Topic
- Scientific/Engineering

Project description

citations-collector

Discover and curate scholarly citations of datasets and software.

Features

Citation Discovery: Query CrossRef, OpenCitations, DataCite for citing papers
Hierarchical Collections: Organize citations by project/version (e.g., DANDI dandisets)
Git-Friendly: YAML collections + TSV citation records for version control
Curation Workflow: Mark citations as ignored, merge preprints with published versions
PDF Acquisition: Automatically download open-access PDFs via Unpaywall with optional git-annex tracking
Merge Detection: Auto-detect preprints with published versions using CrossRef relationships
Zotero Integration: Sync citations to hierarchical Zotero collections with automatic merged item relocation
Incremental Updates: Efficiently discover only new citations since last run

Installation

# Using uv (recommended)
uv venv
source .venv/bin/activate
uv pip install citations-collector

# Or using pip
pip install citations-collector

Quick Start

1. Create a Collection

Create collection.yaml:

name: My Research Tools
description: Software tools used in our lab
items:
  - item_id: my-tool
    name: "My Analysis Tool"
    flavors:
      - flavor_id: "1.0.0"
        refs:
          - ref_type: doi
            ref_value: "10.5281/zenodo.1234567"

2. Discover Citations

# Discover citations for all items in collection
citations-collector discover collection.yaml --output citations.tsv

# Use CrossRef polite pool (better rate limits)
citations-collector discover collection.yaml --email your@email.org

3. View Results

Citations are saved to citations.tsv - a tab-separated file you can open in Excel or edit manually for curation.

Advanced Workflows

PDF Acquisition

Automatically download open-access PDFs using Unpaywall:

# Fetch PDFs for discovered citations
citations-collector fetch-pdfs --config collection.yaml

# Use git-annex for provenance tracking
citations-collector fetch-pdfs --config collection.yaml --git-annex

# Dry run to see what would be downloaded
citations-collector fetch-pdfs --config collection.yaml --dry-run

PDFs are stored at pdfs/{doi}/article.pdf with accompanying article.bib BibTeX files.

Merge Detection

Detect preprints that have published versions:

# Detect merges via CrossRef relationships
citations-collector detect-merges --config collection.yaml

# Also run fuzzy title matching (use with caution)
citations-collector detect-merges --config collection.yaml --fuzzy-match

# Preview without updating
citations-collector detect-merges --config collection.yaml --dry-run

Detected preprints are marked with citation_status=merged and citation_merged_into={published_doi}.

Zotero Sync

Sync citations to Zotero for collaborative browsing:

# Sync to Zotero (requires API key in config or env)
citations-collector sync-zotero --config collection.yaml

# Dry run to preview structure
citations-collector sync-zotero --config collection.yaml --dry-run

Zotero hierarchy:

Top Collection/
  ├── {item_id}/
  │   ├── {flavor}/
  │   │   ├── <active citations>
  │   │   └── Merged/
  │   │       └── <preprints and old versions>

Unified Configuration

Create a unified collection.yaml with all settings:

name: My Research Collection
description: Tools and datasets from our lab

# Source items to track
source:
  items:
    - item_id: dandi-000055
      name: "AJILE12: Long-term naturalistic human intracranial neural recordings"
      flavors:
        - flavor_id: "0.220113.0400"
          refs:
            - ref_type: doi
              ref_value: "10.48324/dandi.000055/0.220113.0400"

# Citation discovery settings
discover:
  sources:
    - crossref
    - opencitations
  email: your@email.org  # For CrossRef polite pool
  incremental: true

# PDF acquisition settings (optional)
pdfs:
  output_dir: pdfs/
  unpaywall_email: your@email.org
  git_annex: false

# Zotero sync settings (optional)
zotero:
  library_type: group
  library_id: "12345"
  api_key: "YOUR_API_KEY"  # Or set ZOTERO_API_KEY env var
  top_collection_key: "ABCD1234"

Then run the full workflow:

# 1. Discover citations
citations-collector discover collection.yaml

# 2. Fetch open-access PDFs
citations-collector fetch-pdfs --config collection.yaml

# 3. Detect merged preprints
citations-collector detect-merges --config collection.yaml

# 4. Sync to Zotero
citations-collector sync-zotero --config collection.yaml

Library Usage

from citations_collector import CitationCollector

# Load collection
collector = CitationCollector.from_yaml("collection.yaml")

# Discover citations (incremental by default)
collector.discover_all(incremental=True, email="your@email.org")

# Save results
collector.save("collection.yaml", "citations.tsv")

Examples

See the examples/ directory for:

dandi-collection.yaml: DANDI Archive dandisets with versioned DOIs
repronim-tools.yaml: ReproNim neuroimaging tools with RRIDs
simple-resources.yaml: Basic collection without versioning
citations-example.tsv: Example citation records with curation

Development

Setup

# Clone repository
git clone https://github.com/dandi/citations-collector.git
cd citations-collector

# Setup development environment
uv venv
source .venv/bin/activate
uv pip install -e ".[devel]"

Running Tests

# Run all tests, linting, and type checking
tox

# Run specific environment
tox -e py312      # Tests on Python 3.12
tox -e lint       # Ruff linting
tox -e type       # Mypy type checking
tox -e cov        # Coverage report

Regenerating LinkML Models

When schema/citations.yaml changes:

# Install linkml tools
uv pip install -e ".[linkml]"

# Regenerate Pydantic models
gen-pydantic schema/citations.yaml > src/citations_collector/models/generated.py

# Regenerate JSON Schema
gen-json-schema schema/citations.yaml > schema/citations.schema.json

# Commit generated files
git add src/citations_collector/models/generated.py schema/citations.schema.json
git commit -m "Regenerate LinkML models"

Architecture

Library-First Design: All functionality accessible programmatically
LinkML Schema: Validated data models from schema/citations.yaml
Modular Structure:
- discovery/: Citation API clients (CrossRef, OpenCitations, DataCite)
- persistence/: YAML/TSV I/O
- importers/: DANDI API, Zenodo, GitHub integrations
- unpaywall.py: Unpaywall API client for OA PDF URLs
- pdf.py: PDF acquisition with git-annex support
- merge_detection.py: Preprint/published version detection
- zotero_sync.py: Zotero hierarchical sync with merged item handling
- core.py: Main orchestration API
- cli.py: Click-based CLI (thin wrapper)

Citation Sources

CrossRef: Most comprehensive, best for DOI citations
OpenCitations: Open index, may lag behind CrossRef
DataCite: Good for dataset citations
Europe PMC: PubMed-indexed papers (future)
Semantic Scholar: AI-powered citation discovery (future)

License

MIT License - see LICENSE file for details.

Contributing

See CONSTITUTION.md for:

Code standards (Ruff, mypy, type hints)
Testing requirements (pytest, 100 lines max, mock HTTP)
Architecture principles (library-first, reliability, simplicity)

Pull requests welcome!

Citation

If you use citations-collector in your research, please cite:

@software{citations_collector,
  title = {citations-collector: Discover and curate scholarly citations},
  author = {{DANDI Team}},
  url = {https://github.com/dandi/citations-collector},
  license = {MIT}
}

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

0.3.0

Apr 13, 2026

0.2.4

Feb 2, 2026

0.2.3

Jan 31, 2026

This version

0.2.2

Jan 30, 2026

0.2.1

Jan 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citations_collector-0.2.2.tar.gz (76.3 kB view details)

Uploaded Jan 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

citations_collector-0.2.2-py3-none-any.whl (54.9 kB view details)

Uploaded Jan 30, 2026 Python 3

File details

Details for the file citations_collector-0.2.2.tar.gz.

File metadata

Download URL: citations_collector-0.2.2.tar.gz
Upload date: Jan 30, 2026
Size: 76.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for citations_collector-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`31c51354b2ea79f5f406bdcb6a385a42073424acd862493435827be23186fca0`
MD5	`f24cfdc9c152982a32105b0798a0ce05`
BLAKE2b-256	`530d254d81ca6bd2fcbb646399ca4e202f7ee2e369356f948bdf2062c71484bc`

See more details on using hashes here.

File details

Details for the file citations_collector-0.2.2-py3-none-any.whl.

File metadata

Download URL: citations_collector-0.2.2-py3-none-any.whl
Upload date: Jan 30, 2026
Size: 54.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for citations_collector-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eeb87fa6ec1686aca64aba11277e7f59021d708955790f82af68facc85551176`
MD5	`dfa0b7bbd8bb03e874c72ad6ff86ad0c`
BLAKE2b-256	`2e29d24c0a9a11a456388b211129ba2ea79c7d079e75f8152d2ec6ab1af0c578`

See more details on using hashes here.

citations-collector 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

citations-collector

Features

Installation

Quick Start

1. Create a Collection

2. Discover Citations

3. View Results

Advanced Workflows

PDF Acquisition

Merge Detection

Zotero Sync

Unified Configuration

Library Usage

Examples

Development

Setup

Running Tests

Regenerating LinkML Models

Architecture

Citation Sources

License

Contributing

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes