Multi-source academic paper clawler for PDE/neural operator/physics-informed ML — OAI-PMH download + PDF download + SQLite paper_store + Cross-Ref + MCP server
Project description
hfpapers-clawler
Naming philosophy:
claw(sharp grasp) ≠crawl(creep).hfpclawer= HuggingFace Papers + claw + er = "A sharp tool that claws HF papers with precision" 🦞Not a crawler — faster, sharper, more precise. Same series: OpenClaw, Hermes Agent ecosystem.
A multi-source academic paper clawler for PDE / neural operator / physics-informed ML. Built with SQLite paper_store, Crossref cross-validation, anti-crawl Scrapy pipelines, and MCP server.
Quick Install
Quick Install
pip install hfpclawer
Dependencies
- Core (auto-installed): pyyaml, requests, beautifulsoup4, typer, etc.
- LLM features (optional):
pip install hfpclawer[llm]— forsniff/analyzecommands - PDF conversion (optional):
pip install hfpclawer[pdf] - Scrapy spiders (optional):
pip install hfpclawer[scrapy] - Dev (testing):
pip install hfpclawer[dev] - arXiv local search (optional):
pip install hfpclawer[arxiv]— requires access to private GitLab repo
Local Development
git clone <your-repo>
cd hfpapers-clawler
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install in editable mode with dev dependencies
pip install -e ".[dev]"
# Verify
hfpclawer --help
Configuration
First run hfpclawer init to generate config and env template:
hfpclawer init --quick # Quick mode (defaults)
# or
hfpclawer init # Interactive wizard
cp .env.template .env # Fill in API keys
# Edit config.yaml to customize search queries
Or manually create files (see docs/USAGE.md for full reference):
CLI Commands
# Search for new papers
hfpclawer search # Default 3 pages, threshold 30
hfpclawer search --max-pages 5 # More pages
hfpclawer search --dry-run # Show only, don't save
# Full pipeline: search → download → convert
hfpclawer full
# SQLite Paper Store operations
hfpclawer store stats # Storage statistics
hfpclawer store search # List all papers
hfpclawer store search --keyword "FNO"
hfpclawer store verify --aid 2301.11167
# Download & convert
hfpclawer download # Download top-20 PDFs
hfpclawer convert # PDF → Markdown
# MCP Server (for Hermes Agent / OpenCode)
hfpclawer mcp # Default port :8765
Python API
from hfpapers.paper_store import PaperStore, PaperRecord, ensure_paper
# Create a store
store = PaperStore(db_path="/tmp/papers.db")
# Add a paper
rec = PaperRecord(
title="Fourier Neural Operator",
abstract="Learning PDE solution operators with Fourier transforms",
year=2023,
source="my_app",
relevance=90,
)
sf_id = store.upsert_paper(rec)
store.add_identifier(sf_id, "arxiv", "2010.08895")
# Search
papers = store.search_papers("neural operator")
for p in papers:
print(f"[{p.relevance}] {p.title}")
# Hardware probe
from hfpapers.hardware import HardwareProbe
hw = HardwareProbe()
print(f"Hardware: {hw.summary()}")
MCP Server
hfpapers-clawler ships with a built-in MCP server for AI agent integration:
hfpclawer mcp
Register in Hermes Agent ~/.hermes/config.yaml:
mcp:
servers:
hfpapers:
command: "hfpclawer"
args: ["mcp", "--port", "8765"]
Available MCP tools: hfpclawer_search, hfpclawer_download, hfpclawer_convert, hfpclawer_info, hfpclawer_list, hfpclawer_stats, hfpclawer_full.
Architecture
┌─ CLI (Typer) ─┐ ┌─ MCP Server ─┐
└──────┬────────┘ └──────┬───────┘
└────────┬──────────┘
▼
┌─ Scrapy Layer (Multi-source) ───────────┐
│ ArxivSearchSpider | OpenReviewSpider │
│ HFPapersSpider | MultiSourceSpider │
│ Middleware: UA random, delay, proxy... │
│ Pipeline: Store→Classify→Export→DL │
└──────────────────┬──────────────────────┘
▼
┌─ Paper Store (SQLite) ──────────────────┐
│ papers (Snowflake ID) | identifiers │
│ crossref_cache | CrossrefClient │
└─────────────────────────────────────────┘
Tests
pip install -e ".[dev]"
pytest tests/ -v # Run all tests
pytest tests/ --cov=hfpapers # With coverage
License
MIT
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hfpclawer-0.5.0.tar.gz.
File metadata
- Download URL: hfpclawer-0.5.0.tar.gz
- Upload date:
- Size: 120.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
487e902d38bcc50539214d1044364c97249770aa0f752c7a5de0c7a24f960c05
|
|
| MD5 |
a1abb8aff3d0b57de63d02d1c2972f75
|
|
| BLAKE2b-256 |
f0eafcd49f258150e33de9bbd0e1164f5fb9672145ec4079c7bd8f0e8db5b21c
|
File details
Details for the file hfpclawer-0.5.0-py3-none-any.whl.
File metadata
- Download URL: hfpclawer-0.5.0-py3-none-any.whl
- Upload date:
- Size: 120.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0efee749f3cdc0ab597d044f7b23a7b1c9ea1eab2e824335d941b03c15a5b7b6
|
|
| MD5 |
bb819b4541e240f8a685364983646b29
|
|
| BLAKE2b-256 |
f10204a04c03f4b1959ef1b3a2ca2ec65a860329c156b8b683e4888b9294ef8d
|