Skip to main content

Multi-source academic paper clawler for PDE/neural operator/physics-informed ML — OAI-PMH download + PDF download + SQLite paper_store + Cross-Ref + MCP server

Project description

English | 简体中文

hfpapers-clawler

Naming philosophy: claw (sharp grasp) ≠ crawl (creep). hfpclawer = HuggingFace Papers + claw + er = "A sharp tool that claws HF papers with precision" 🦞

Not a crawler — faster, sharper, more precise. Same series: OpenClaw, Hermes Agent ecosystem.

A multi-source academic paper clawler for PDE / neural operator / physics-informed ML. Built with SQLite paper_store, Crossref cross-validation, anti-crawl Scrapy pipelines, and MCP server.


Quick Install


Quick Install

pip install hfpclawer

Dependencies

  • Core (auto-installed): pyyaml, requests, beautifulsoup4, typer, etc.
  • LLM features (optional): pip install hfpclawer[llm] — for sniff / analyze commands
  • PDF conversion (optional): pip install hfpclawer[pdf]
  • Scrapy spiders (optional): pip install hfpclawer[scrapy]
  • Dev (testing): pip install hfpclawer[dev]
  • arXiv local search (optional): pip install hfpclawer[arxiv] — requires access to private GitLab repo

Local Development

git clone <your-repo>
cd hfpapers-clawler

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Verify
hfpclawer --help

Configuration

First run hfpclawer init to generate config and env template:

hfpclawer init --quick          # Quick mode (defaults)
# or
hfpclawer init                  # Interactive wizard
cp .env.template .env           # Fill in API keys
# Edit config.yaml to customize search queries

Or manually create files (see docs/USAGE.md for full reference):


CLI Commands

# Search for new papers
hfpclawer search                    # Default 3 pages, threshold 30
hfpclawer search --max-pages 5      # More pages
hfpclawer search --dry-run          # Show only, don't save

# Full pipeline: search → download → convert
hfpclawer full

# SQLite Paper Store operations
hfpclawer store stats               # Storage statistics
hfpclawer store search              # List all papers
hfpclawer store search --keyword "FNO"
hfpclawer store verify --aid 2301.11167

# Download & convert
hfpclawer download                  # Download top-20 PDFs
hfpclawer convert                   # PDF → Markdown

# MCP Server (for Hermes Agent / OpenCode)
hfpclawer mcp                       # Default port :8765

Python API

from hfpapers.paper_store import PaperStore, PaperRecord, ensure_paper

# Create a store
store = PaperStore(db_path="/tmp/papers.db")

# Add a paper
rec = PaperRecord(
    title="Fourier Neural Operator",
    abstract="Learning PDE solution operators with Fourier transforms",
    year=2023,
    source="my_app",
    relevance=90,
)
sf_id = store.upsert_paper(rec)
store.add_identifier(sf_id, "arxiv", "2010.08895")

# Search
papers = store.search_papers("neural operator")
for p in papers:
    print(f"[{p.relevance}] {p.title}")

# Hardware probe
from hfpapers.hardware import HardwareProbe
hw = HardwareProbe()
print(f"Hardware: {hw.summary()}")

MCP Server

hfpapers-clawler ships with a built-in MCP server for AI agent integration:

hfpclawer mcp

Register in Hermes Agent ~/.hermes/config.yaml:

mcp:
  servers:
    hfpapers:
      command: "hfpclawer"
      args: ["mcp", "--port", "8765"]

Available MCP tools: hfpclawer_search, hfpclawer_download, hfpclawer_convert, hfpclawer_info, hfpclawer_list, hfpclawer_stats, hfpclawer_full.


Architecture

┌─ CLI (Typer) ─┐  ┌─ MCP Server ─┐
└──────┬────────┘  └──────┬───────┘
       └────────┬──────────┘
                ▼
┌─ Scrapy Layer (Multi-source) ───────────┐
│  ArxivSearchSpider | OpenReviewSpider    │
│  HFPapersSpider | MultiSourceSpider      │
│  Middleware: UA random, delay, proxy...  │
│  Pipeline: Store→Classify→Export→DL     │
└──────────────────┬──────────────────────┘
                   ▼
┌─ Paper Store (SQLite) ──────────────────┐
│  papers (Snowflake ID) | identifiers    │
│  crossref_cache | CrossrefClient        │
└─────────────────────────────────────────┘

Tests

pip install -e ".[dev]"
pytest tests/ -v           # Run all tests
pytest tests/ --cov=hfpapers  # With coverage

License

MIT

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hfpclawer-0.5.0.tar.gz (120.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hfpclawer-0.5.0-py3-none-any.whl (120.0 kB view details)

Uploaded Python 3

File details

Details for the file hfpclawer-0.5.0.tar.gz.

File metadata

  • Download URL: hfpclawer-0.5.0.tar.gz
  • Upload date:
  • Size: 120.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for hfpclawer-0.5.0.tar.gz
Algorithm Hash digest
SHA256 487e902d38bcc50539214d1044364c97249770aa0f752c7a5de0c7a24f960c05
MD5 a1abb8aff3d0b57de63d02d1c2972f75
BLAKE2b-256 f0eafcd49f258150e33de9bbd0e1164f5fb9672145ec4079c7bd8f0e8db5b21c

See more details on using hashes here.

File details

Details for the file hfpclawer-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: hfpclawer-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 120.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for hfpclawer-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0efee749f3cdc0ab597d044f7b23a7b1c9ea1eab2e824335d941b03c15a5b7b6
MD5 bb819b4541e240f8a685364983646b29
BLAKE2b-256 f10204a04c03f4b1959ef1b3a2ca2ec65a860329c156b8b683e4888b9294ef8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page