Local CrossRef database with 167M+ works and full-text search
Project description
CrossRef Local (crossref-local)
Local CrossRef database with 167M+ scholarly works, full-text search, and impact factor calculation
Demo
# Search 167M papers locally — no API rate limits, ~22 ms full-text query
crossref-local search "epilepsy seizure prediction"
# Resolve a DOI to full record (title, abstract, citations, journal IF)
crossref-local search-by-doi 10.1038/nature11247
# Drive from MCP / Claude Code
crossref-local mcp serve
The image is a live capture against the local DB; the <details>
block below has a 6m55s MCP-driven demo video.
Architecture
┌──────────────────────────┐ ┌──────────────────────────┐
│ CrossRef public dump │ │ JCR / OpenAlex IF tables │
│ (~100 GB compressed) │ │ │
└──────────────┬───────────┘ └──────────────┬───────────┘
│ dois2sqlite │
▼ ▼
┌─────────────────┐ ┌──────────────┐
│ crossref.db │ ◀── joins ──▶ │ impact-factor│
│ (SQLite + FTS5) │ │ table │
└────────┬────────┘ └──────────────┘
│
▼
┌──────────────────────────────────┐
│ crossref-local — Python / CLI / MCP │
│ search · search-by-doi · cache │
│ stats · check-citations · relay │
└──────────────────────────────────┘
The DB lives entirely on disk; crossref-local is a thin facade over
SQLite + FTS5 + a small impact-factor table. No network calls during
queries; rebuild scripts under make fts-build-screen /
citations-build-screen are the only producers of state.
MCP Demo Video
Live demonstration of MCP server integration with Claude Code for epilepsy seizure prediction literature review:
- Full-text search on title, abstracts, and keywords across 167M papers (22ms response)
Why CrossRef Local?
Built for the LLM era - features that matter for AI research assistants:
| Feature | Benefit |
|---|---|
| 📝 Abstracts | Full text for semantic understanding |
| 📊 Impact Factor | Filter by journal quality |
| 🔗 Citations | Prioritize influential papers |
| ⚡ Speed | 167M records in ms, no rate limits |
Perfect for: RAG systems, research assistants, literature review automation.
Installation
pip install crossref-local
From source:
git clone https://github.com/ywatanabe1989/crossref-local
cd crossref-local && make install
Database setup (1.5 TB, ~2 weeks to build):
# 1. Download CrossRef data (~100GB compressed)
aria2c "https://academictorrents.com/details/..."
# 2. Build SQLite database (~days)
pip install dois2sqlite
dois2sqlite build /path/to/crossref-data ./data/crossref.db
# 3. Build FTS5 index (~60 hours) & citations table (~days)
make fts-build-screen
make citations-build-screen
Python API
from crossref_local import search, get, count
# Full-text search (22ms for 541 matches across 167M records)
results = search("hippocampal sharp wave ripples")
for work in results:
print(f"{work.title} ({work.year})")
# Get by DOI
work = get("10.1126/science.aax0758")
print(work.citation())
# Count matches
n = count("machine learning") # 477,922 matches
Async API:
from crossref_local import aio
async def main():
counts = await aio.count_many(["CRISPR", "neural network", "climate"])
results = await aio.search("machine learning")
CLI
crossref-local search "CRISPR genome editing" -n 5
crossref-local search-by-doi 10.1038/nature12373
crossref-local status # Configuration and database stats
With abstracts (-a flag):
$ crossref-local search "RS-1 enhances CRISPR" -n 1 -a
Found 4 matches in 128.4ms
1. RS-1 enhances CRISPR/Cas9- and TALEN-mediated knock-in efficiency (2016)
DOI: 10.1038/ncomms10548
Journal: Nature Communications
Abstract: Zinc-finger nuclease, transcription activator-like effector nuclease
and CRISPR/Cas9 are becoming major tools for genome editing...
HTTP API
Start the FastAPI server:
crossref-local relay --host 0.0.0.0 --port 31291
Endpoints:
# Search works (FTS5)
curl "http://localhost:31291/works?q=CRISPR&limit=10"
# Get by DOI
curl "http://localhost:31291/works/10.1038/nature12373"
# Batch DOI lookup
curl -X POST "http://localhost:31291/works/batch" \
-H "Content-Type: application/json" \
-d '{"dois": ["10.1038/nature12373", "10.1126/science.aax0758"]}'
# Citation endpoints
curl "http://localhost:31291/citations/10.1038/nature12373/citing"
curl "http://localhost:31291/citations/10.1038/nature12373/cited"
curl "http://localhost:31291/citations/10.1038/nature12373/count"
# Collection endpoints
curl "http://localhost:31291/collections"
curl -X POST "http://localhost:31291/collections" \
-H "Content-Type: application/json" \
-d '{"name": "my_papers", "query": "CRISPR", "limit": 100}'
curl "http://localhost:31291/collections/my_papers/download?format=bibtex"
# Database info
curl "http://localhost:31291/info"
HTTP mode (connect to running server):
# On local machine (if server is remote)
ssh -L 31291:127.0.0.1:31291 your-server
# Python client
from crossref_local import configure_http
configure_http("http://localhost:31291")
# Or via CLI
crossref-local --http search "CRISPR"
MCP Server
Run as MCP (Model Context Protocol) server:
crossref-local mcp start
Local MCP client configuration:
{
"mcpServers": {
"crossref-local": {
"command": "crossref-local",
"args": ["mcp", "start"],
"env": {
"CROSSREF_LOCAL_DB": "/path/to/crossref.db"
}
}
}
}
Remote MCP via HTTP (recommended):
# On server: start persistent MCP server
crossref-local mcp start -t http --host 0.0.0.0 --port 8082
{
"mcpServers": {
"crossref-remote": {
"url": "http://your-server:8082/mcp"
}
}
}
Diagnose setup:
crossref-local mcp doctor # Check dependencies and database
crossref-local mcp list-tools # Show available MCP tools
crossref-local mcp installation # Show client config examples
See docs/remote-deployment.md for systemd and Docker setup.
Available tools:
search- Full-text search across 167M+ paperssearch_by_doi- Get paper by DOIenrich_dois- Add citation counts and references to DOIsstatus- Database statisticscache_*- Paper collection management
Impact Factor
from crossref_local.impact_factor import ImpactFactorCalculator
with ImpactFactorCalculator() as calc:
result = calc.calculate_impact_factor("Nature", target_year=2023)
print(f"IF: {result['impact_factor']:.3f}") # 54.067
| Journal | IF 2023 |
|---|---|
| Nature | 54.07 |
| Science | 46.17 |
| Cell | 54.01 |
| PLOS ONE | 3.37 |
Citation Network
from crossref_local import get_citing, get_cited, CitationNetwork
citing = get_citing("10.1038/nature12373") # 1539 papers
cited = get_cited("10.1038/nature12373")
# Build visualization (like Connected Papers)
network = CitationNetwork("10.1038/nature12373", depth=2)
network.save_html("citation_network.html") # requires: pip install crossref-local[viz]
Performance
| Query | Matches | Time |
|---|---|---|
hippocampal sharp wave ripples |
541 | 22ms |
machine learning |
477,922 | 113ms |
CRISPR genome editing |
12,170 | 257ms |
Searching 167M records in milliseconds via FTS5.
Related Projects
openalex-local - Sister project with OpenAlex data:
| Feature | crossref-local | openalex-local |
|---|---|---|
| Works | 167M | 284M |
| Abstracts | ~21% | ~45-60% |
| Update frequency | Real-time | Monthly |
| DOI authority | ✓ (source) | Uses CrossRef |
| Citations | Raw references | Linked works |
| Concepts/Topics | ❌ | ✓ |
| Author IDs | ❌ | ✓ |
| Best for | DOI lookup, raw refs | Semantic search |
When to use CrossRef: Real-time DOI updates, raw reference parsing, authoritative metadata. When to use OpenAlex: Semantic search, citation analysis, topic discovery.
Installation
Recommended:
uv pip install crossref-local[all]— uv's Rust resolver handles the SciTeX dep set in 1-3 min where pip's serial backtracker can take 30+ min on the full extras. Plainpip installstill works; the install block below shows both.
pip install crossref-local # core
pip install crossref-local[mcp] # + MCP server
4 Interfaces
Python API
from crossref_local import crossref_search, get_work
results = crossref_search("deep learning EEG", limit=10)
work = get_work("10.1038/nature12373")
CLI
crossref-local search "query"
crossref-local doi 10.1038/nature12373
MCP Server
crossref-local mcp start
Skills
Agent skill pages live under src/crossref_local/_skills/crossref-local/.
Problem and Solution
| # | Problem | Solution |
|---|---|---|
| 1 | CrossRef public API is rate-limited + requires internet + slow for bulk queries -- 167M works is the bottleneck for literature tools | Local SQLite + FTS5 -- full CrossRef dump (~60 GB) queryable offline; crossref_search returns in milliseconds |
Part of SciTeX
crossref-local is part of SciTeX. Install via
the umbrella with pip install scitex[scholar] to use as
scitex.scholar (Python) or scitex scholar ... (CLI) — crossref-local
provides the local CrossRef backing for scholar's DOI resolution.
import scitex
scitex.scholar.enrich_bibtex("references.bib")
scitex.scholar.check_citations("manuscript.tex")
Four Freedoms for Research
- The freedom to run your research anywhere — your machine, your terms.
- The freedom to study how every step works — from raw data to final manuscript.
- The freedom to redistribute your workflows, not just your papers.
- The freedom to modify any module and share improvements with the community.
AGPL-3.0 — because we believe research infrastructure deserves the same freedoms as the software it runs on.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crossref_local-0.7.4.tar.gz.
File metadata
- Download URL: crossref_local-0.7.4.tar.gz
- Upload date:
- Size: 4.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a149a204d1cddee588c4991e19085e3b8346ec884f2d31588c681a98fe1fe385
|
|
| MD5 |
2af3214f82ebb3eb180bbefece4eec4a
|
|
| BLAKE2b-256 |
482e24979b742f0a55164ea948aadd7ef80b53666fa7864d31cb380d471981a8
|
Provenance
The following attestation bundles were made for crossref_local-0.7.4.tar.gz:
Publisher:
publish-pypi.yml on ywatanabe1989/crossref-local
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crossref_local-0.7.4.tar.gz -
Subject digest:
a149a204d1cddee588c4991e19085e3b8346ec884f2d31588c681a98fe1fe385 - Sigstore transparency entry: 1642230455
- Sigstore integration time:
-
Permalink:
ywatanabe1989/crossref-local@5dcc3755dbee19e80e7bfbae68038d9efa136668 -
Branch / Tag:
refs/tags/v0.7.4 - Owner: https://github.com/ywatanabe1989
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@5dcc3755dbee19e80e7bfbae68038d9efa136668 -
Trigger Event:
push
-
Statement type:
File details
Details for the file crossref_local-0.7.4-py3-none-any.whl.
File metadata
- Download URL: crossref_local-0.7.4-py3-none-any.whl
- Upload date:
- Size: 4.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee16c564cb93be10609659937acd2323fe61813438a7798f8f929b9858e5d881
|
|
| MD5 |
84c9e0f373c01c0228b8362ce57bdeeb
|
|
| BLAKE2b-256 |
9624306d706c250d1d0025940c03e75664b17f86ddda797f01dd46f2def9f8ee
|
Provenance
The following attestation bundles were made for crossref_local-0.7.4-py3-none-any.whl:
Publisher:
publish-pypi.yml on ywatanabe1989/crossref-local
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crossref_local-0.7.4-py3-none-any.whl -
Subject digest:
ee16c564cb93be10609659937acd2323fe61813438a7798f8f929b9858e5d881 - Sigstore transparency entry: 1642230612
- Sigstore integration time:
-
Permalink:
ywatanabe1989/crossref-local@5dcc3755dbee19e80e7bfbae68038d9efa136668 -
Branch / Tag:
refs/tags/v0.7.4 - Owner: https://github.com/ywatanabe1989
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@5dcc3755dbee19e80e7bfbae68038d9efa136668 -
Trigger Event:
push
-
Statement type: