Skip to main content

Local CrossRef database with 167M+ works and full-text search

Project description

CrossRef Local

Local CrossRef database with 167M+ scholarly works, full-text search, and impact factor calculation.

PyPI version Documentation Tests Coverage Python License

CrossRef Local Demo

MCP Demo Video

Demo Video Thumbnail

Live demonstration of MCP server integration with Claude Code for epilepsy seizure prediction literature review:

  • Full-text search on title, abstracts, and keywords across 167M papers (22ms response)

📄 Full demo documentation | 📊 Generated diagrams

Why CrossRef Local?

Built for the LLM era - features that matter for AI research assistants:

Feature Benefit
📝 Abstracts Full text for semantic understanding
📊 Impact Factor Filter by journal quality
🔗 Citations Prioritize influential papers
Speed 167M records in ms, no rate limits

Perfect for: RAG systems, research assistants, literature review automation.

Installation
pip install crossref-local

From source:

git clone https://github.com/ywatanabe1989/crossref-local
cd crossref-local && make install

Database setup (1.5 TB, ~2 weeks to build):

# 1. Download CrossRef data (~100GB compressed)
aria2c "https://academictorrents.com/details/..."

# 2. Build SQLite database (~days)
pip install dois2sqlite
dois2sqlite build /path/to/crossref-data ./data/crossref.db

# 3. Build FTS5 index (~60 hours) & citations table (~days)
make fts-build-screen
make citations-build-screen
Python API
from crossref_local import search, get, count

# Full-text search (22ms for 541 matches across 167M records)
results = search("hippocampal sharp wave ripples")
for work in results:
    print(f"{work.title} ({work.year})")

# Get by DOI
work = get("10.1126/science.aax0758")
print(work.citation())

# Count matches
n = count("machine learning")  # 477,922 matches

Async API:

from crossref_local import aio

async def main():
    counts = await aio.count_many(["CRISPR", "neural network", "climate"])
    results = await aio.search("machine learning")
CLI
crossref-local search "CRISPR genome editing" -n 5
crossref-local search-by-doi 10.1038/nature12373
crossref-local status  # Configuration and database stats

With abstracts (-a flag):

$ crossref-local search "RS-1 enhances CRISPR" -n 1 -a

Found 4 matches in 128.4ms

1. RS-1 enhances CRISPR/Cas9- and TALEN-mediated knock-in efficiency (2016)
   DOI: 10.1038/ncomms10548
   Journal: Nature Communications
   Abstract: Zinc-finger nuclease, transcription activator-like effector nuclease
   and CRISPR/Cas9 are becoming major tools for genome editing...
HTTP API

Start the FastAPI server:

crossref-local relay --host 0.0.0.0 --port 31291

Endpoints:

# Search works (FTS5)
curl "http://localhost:31291/works?q=CRISPR&limit=10"

# Get by DOI
curl "http://localhost:31291/works/10.1038/nature12373"

# Batch DOI lookup
curl -X POST "http://localhost:31291/works/batch" \
  -H "Content-Type: application/json" \
  -d '{"dois": ["10.1038/nature12373", "10.1126/science.aax0758"]}'

# Citation endpoints
curl "http://localhost:31291/citations/10.1038/nature12373/citing"
curl "http://localhost:31291/citations/10.1038/nature12373/cited"
curl "http://localhost:31291/citations/10.1038/nature12373/count"

# Collection endpoints
curl "http://localhost:31291/collections"
curl -X POST "http://localhost:31291/collections" \
  -H "Content-Type: application/json" \
  -d '{"name": "my_papers", "query": "CRISPR", "limit": 100}'
curl "http://localhost:31291/collections/my_papers/download?format=bibtex"

# Database info
curl "http://localhost:31291/info"

HTTP mode (connect to running server):

# On local machine (if server is remote)
ssh -L 31291:127.0.0.1:31291 your-server

# Python client
from crossref_local import configure_http
configure_http("http://localhost:31291")

# Or via CLI
crossref-local --http search "CRISPR"
MCP Server

Run as MCP (Model Context Protocol) server:

crossref-local mcp start

Local MCP client configuration:

{
  "mcpServers": {
    "crossref-local": {
      "command": "crossref-local",
      "args": ["mcp", "start"],
      "env": {
        "CROSSREF_LOCAL_DB": "/path/to/crossref.db"
      }
    }
  }
}

Remote MCP via HTTP (recommended):

# On server: start persistent MCP server
crossref-local mcp start -t http --host 0.0.0.0 --port 8082
{
  "mcpServers": {
    "crossref-remote": {
      "url": "http://your-server:8082/mcp"
    }
  }
}

Diagnose setup:

crossref-local mcp doctor        # Check dependencies and database
crossref-local mcp list-tools    # Show available MCP tools
crossref-local mcp installation  # Show client config examples

See docs/remote-deployment.md for systemd and Docker setup.

Available tools:

  • search - Full-text search across 167M+ papers
  • search_by_doi - Get paper by DOI
  • enrich_dois - Add citation counts and references to DOIs
  • status - Database statistics
  • cache_* - Paper collection management
Impact Factor
from crossref_local.impact_factor import ImpactFactorCalculator

with ImpactFactorCalculator() as calc:
    result = calc.calculate_impact_factor("Nature", target_year=2023)
    print(f"IF: {result['impact_factor']:.3f}")  # 54.067
Journal IF 2023
Nature 54.07
Science 46.17
Cell 54.01
PLOS ONE 3.37
Citation Network
from crossref_local import get_citing, get_cited, CitationNetwork

citing = get_citing("10.1038/nature12373")  # 1539 papers
cited = get_cited("10.1038/nature12373")

# Build visualization (like Connected Papers)
network = CitationNetwork("10.1038/nature12373", depth=2)
network.save_html("citation_network.html")  # requires: pip install crossref-local[viz]
Performance
Query Matches Time
hippocampal sharp wave ripples 541 22ms
machine learning 477,922 113ms
CRISPR genome editing 12,170 257ms

Searching 167M records in milliseconds via FTS5.

Related Projects

openalex-local - Sister project with OpenAlex data:

Feature crossref-local openalex-local
Works 167M 284M
Abstracts ~21% ~45-60%
Update frequency Real-time Monthly
DOI authority ✓ (source) Uses CrossRef
Citations Raw references Linked works
Concepts/Topics
Author IDs
Best for DOI lookup, raw refs Semantic search

When to use CrossRef: Real-time DOI updates, raw reference parsing, authoritative metadata. When to use OpenAlex: Semantic search, citation analysis, topic discovery.


SciTeX
AGPL-3.0 · ywatanabe@scitex.ai

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crossref_local-0.5.0.tar.gz (198.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crossref_local-0.5.0-py3-none-any.whl (74.0 kB view details)

Uploaded Python 3

File details

Details for the file crossref_local-0.5.0.tar.gz.

File metadata

  • Download URL: crossref_local-0.5.0.tar.gz
  • Upload date:
  • Size: 198.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0rc1

File hashes

Hashes for crossref_local-0.5.0.tar.gz
Algorithm Hash digest
SHA256 7534fac3b6d1244c3ad2895db67c7ce19dadfa6a059ed34e6c7b26a7a20306ed
MD5 de01c04706c803a06916e912defa5644
BLAKE2b-256 932bd71dfce77914a76d197d2f5f39ff7bc0a7d50c0b9101663cc9e926d1090c

See more details on using hashes here.

File details

Details for the file crossref_local-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: crossref_local-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 74.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0rc1

File hashes

Hashes for crossref_local-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 63d9808f19ed09e4bc87aab4b568bcbfb8cef69515e322bad392412a6a7e78de
MD5 7b8fc7f7c658897b4d794f5b2a58b577
BLAKE2b-256 d973ba22f15f7e80bbf9e09e2441144d29a5084cd4149594ce0560c2540d5f58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page