Skip to main content

异步高并发citation爬虫,慎用

Project description

citation-crawler

Asynchronous high-concurrency dblp crawler, use with caution!

异步高并发引文数据爬虫,慎用

Only support Semantic Scholar currently.

目前支持从Semantic Scholar上爬references和citations

Crawl papers from dblp and connect them into an undirected graph. Each edge is a paper, each node is an author.

爬引文数据并将其组织为无向图。图的节点是文章,边是引用关系

Install

pip install citation-crawler

Usage

Config environment variables

  • CITATION_CRAWLER_MAX_CACHE_DAYS_AUTHORS:
    • save cache for a paper authors page (to get authors of a published paper) for how many days
    • default: -1 (cache forever, since authors of a paper are not likely to change)
  • CITATION_CRAWLER_MAX_CACHE_DAYS_REFERENCES:
    • save cache for a reference page (to get references of a published paper) for how many days
    • default: -1 (cache forever, since references of a paper are not likely to change)
  • CITATION_CRAWLER_MAX_CACHE_DAYS_CITATIONS
    • save cache for a citation page (to get citations of a published paper) for how many days
    • default: 7 (citations of a paper may change frequently)
  • CITATION_CRAWLER_MAX_CACHE_DAYS_PAPER
    • save cache for a paper detail page (to get details of a paper) for how many days
    • default: -1 (cache forever, since detailed information of a published paper are not likely to change)
  • HTTP_PROXY
    • Set it http://your_user:your_password@your_proxy_url:your_proxy_port if you want to use proxy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citation_crawler-1.3.0.tar.gz (9.5 kB view details)

Uploaded Source

File details

Details for the file citation_crawler-1.3.0.tar.gz.

File metadata

  • Download URL: citation_crawler-1.3.0.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for citation_crawler-1.3.0.tar.gz
Algorithm Hash digest
SHA256 adb3adf92a668688ad0773538924b75721b0e94d3377bf05e42acb6a6276932b
MD5 3eace8d1b2b361c55b5477d16a350f77
BLAKE2b-256 54dcd3cf01c7063f64987224b82d18328abe483b20828da883ba8082a8d7fe9a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page