异步高并发dblp爬虫,慎用
Project description
dblp-crawler
Asynchronous high-concurrency dblp crawler, use with caution!
异步高并发dblp爬虫,慎用!
Install
pip install dblp-crawler
Usage
Help
python -m dblp_crawler -h
usage: __main__.py [-h] [-y YEAR] -k KEYWORD [-p PID] [-j JOURNAL] {networkx,neo4j} ...
positional arguments:
{networkx,neo4j} sub-command help
networkx networkx help
neo4j neo4j help
optional arguments:
-h, --help show this help message and exit
-y YEAR, --year YEAR Only crawl the paper after the specified year.
-k KEYWORD, --keyword KEYWORD
Specify keyword rules.
-p PID, --pid PID Specified author pids to start crawling.
-j JOURNAL, --journal JOURNAL
Specify author journal keys to start crawling.
python -m dblp_crawler networkx -h
usage: __main__.py networkx [-h] --dest DEST
optional arguments:
-h, --help show this help message and exit
--dest DEST Path to write results.
python -m dblp_crawler neo4j -h
usage: __main__.py neo4j [-h] [--auth AUTH] --uri URI
optional arguments:
-h, --help show this help message and exit
--auth AUTH Auth to neo4j database.
--uri URI URI to neo4j database.
Config environment variables
DBLP_CRAWLER_MAX_CACHE_DAYS_PERSON
:- save cache for a person page for how many days
- default:
30
DBLP_CRAWLER_MAX_CACHE_DAYS_JOURNAL
:- save cache for a journal page for how many days
- default:
300
DBLP_CRAWLER_MAX_CACHE_DAYS_JOURNAL_LIST
- save cache for a journal list page for how many days
- default:
300
Write to a JSON file
e.g. write to summary.json
:
python -m dblp_crawler -k video -k edge -p l/JiangchuanLiu networkx --dest summary.json
Write to a Neo4J database
e.g. write to neo4j://10.128.202.18:7687
:
python -m dblp_crawler -k video -k edge -p l/JiangchuanLiu neo4j --uri neo4j://10.128.202.18:7687
Only crawl the paper after specified year
e.g. crawl the paper after 2016 (include 2016)
python -m dblp_crawler -k video -k edge -p l/JiangchuanLiu -y 2016 networkx --dest summary.json
Keywords with two or more words
e.g. super resolution (publications with title contains both "super" and "resolution" will be selected)
python -m dblp_crawler -k video -k edge -p l/JiangchuanLiu -k "'super','resolution'" networkx --dest summary.json
Init authors from journal
e.g. init authors from ACM MM (db/conf/mm
is the key for ACM MM in dblp: "https://dblp.org/db/conf/mm/index.xml")
python -m dblp_crawler -k video -k edge -j db/conf/fast networkx --dest summary.json
Init authors from journal in some variables
e.g. there is a CCF_A
in dblp_crawler.data
contains keys of CCF A conferences
python -m dblp_crawler -k video -k edge -j "importlib.import_module('dblp_crawler.data').CCF_A" networkx --dest summary.json
dblp_crawler.filter
Usage
Help
python -m dblp_crawler.filter -h
usage: __main__.py [-h] -i INPUT -o OUTPUT -f FILTER
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input file path.
-o OUTPUT, --output OUTPUT
Output file path.
-f FILTER, --filter FILTER
Filter functions.
Use internal filter
e.g. drop_old_publications
is an internal function that drop publication by year
python -m dblp_crawler.filter -i summary.json -o summary.filter.json -f "lambda summary: drop_old_publications(summary, 2016)"
Use multiple filters
e.g. drop_old_publications
is an internal function that drop publications by year; drop_nodes_by_all_publications
is an internal function that drop nodes by the sum of publications
python -m dblp_crawler.filter -i summary.json -o summary.filter.json \
-f "lambda summary: drop_old_publications(summary, 2016)" \
-f "lambda summary: drop_old_person_publications(summary, 2018)" \
-f "lambda summary: drop_old_cooperation(summary, 2018)" \
-f "lambda summary: drop_nodes_by_all_publications(summary, 4)" \
-f "lambda summary: drop_edges_by_all_publications(summary, 4)"
Use your own filter
e.g. another method to use -f "lambda summary: drop_old_publications(summary, 2016)"
python -m dblp_crawler.filter -i summary.json -o summary.filter.json -f "lambda summary: importlib.import_module('dblp_crawler.filter').drop_old_publications(summary, 2016)"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file dblp_crawler-1.6.4.tar.gz
.
File metadata
- Download URL: dblp_crawler-1.6.4.tar.gz
- Upload date:
- Size: 30.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53213997827e7dd371bec24fa75a77d6d86495e0dce74f7be7876825421c30c6 |
|
MD5 | 24922ea3fbe2465b0f136d259681b025 |
|
BLAKE2b-256 | 6e39e6ea684cb269d315b7f9ae5acaa9251f332f49ddebefb6b72873ff6b86f6 |