异步高并发citation爬虫，慎用

These details have not been verified by PyPI

Project links

Homepage

Project description

citation-crawler

Asynchronous high-concurrency dblp crawler, use with caution!

异步高并发引文数据爬虫，慎用

Only support Semantic Scholar currently.

目前支持从Semantic Scholar上爬references和citations

Crawl papers from dblp and connect them into an undirected graph. Each edge is a paper, each node is an author.

爬引文数据并将其组织为无向图。图的节点是文章，边是引用关系

Neo4J output compatible with dblp-crawler

Neo4J形式的输出和dblp-crawler兼容，可以自动识别同一个paper不产生重复节点

Install

pip install citation-crawler

Usage

python -m citation_crawler -h
usage: __main__.py [-h] [-y YEAR] [-l LIMIT] -k KEYWORD [-p PID] {networkx,neo4j} ...

positional arguments:
  {networkx,neo4j}      sub-command help
    networkx            Write results to a json file.
    neo4j               Write result to neo4j database

optional arguments:
  -h, --help            show this help message and exit
  -y YEAR, --year YEAR  Only crawl the paper after the specified year.
  -l LIMIT, --limit LIMIT
                        Limitation of BFS depth.
  -k KEYWORD, --keyword KEYWORD
                        Specify keyword rules.
  -p PID, --pid PID     Specified a list of paperId to start crawling.

python -m citation_crawler networkx -h
usage: __main__.py networkx [-h] --dest DEST

optional arguments:
  -h, --help   show this help message and exit
  --dest DEST  Path to write results.

python -m citation_crawler neo4j -h   
usage: __main__.py neo4j [-h] [--auth AUTH] --uri URI

optional arguments:
  -h, --help   show this help message and exit
  --auth AUTH  Auth to neo4j database.
  --uri URI    URI to neo4j database.

Config environment variables

CITATION_CRAWLER_MAX_CACHE_DAYS_AUTHORS:
- save cache for a paper authors page (to get authors of a published paper) for how many days
- default: -1 (cache forever, since authors of a paper are not likely to change)
CITATION_CRAWLER_MAX_CACHE_DAYS_REFERENCES:
- save cache for a reference page (to get references of a published paper) for how many days
- default: -1 (cache forever, since references of a paper are not likely to change)
CITATION_CRAWLER_MAX_CACHE_DAYS_CITATIONS
- save cache for a citation page (to get citations of a published paper) for how many days
- default: 7 (citations of a paper may change frequently)
CITATION_CRAWLER_MAX_CACHE_DAYS_PAPER
- save cache for a paper detail page (to get details of a paper) for how many days
- default: -1 (cache forever, since detailed information of a published paper are not likely to change)
HTTP_PROXY
- Set it http://your_user:your_password@your_proxy_url:your_proxy_port if you want to use proxy
HTTP_CONCORRENT
- Concurrent HTTP requests
- default: 8

Write to a JSON file

e.g. write to summary.json:

python -m citation_crawler -k video -k edge -p 27d5dc70280c8628f181a7f8881912025f808256 networkx --dest summary.json

JSON format

{
  "nodes": {
    "<paperId of a paper in Semantic Scholar>": {
      "paperId": "<paperId of this paper in Semantic Scholar>",
      "dblp_key": "<dblp id of this paper>",
      "title": "<title of this paper>",
      "year": "int <publish year of this paper>",
      "doi": "<doi of this paper>",
      "authors": [
        {
          "authorId": "<authorId of this person in Semantic Scholar>",
          "name": "<name of this person>",
          "dblp_name": [
            "<disambiguation name of this person in dblp>",
            "<disambiguation name of this person in dblp>",
            "<disambiguation name of this person in dblp>",
            "......"
          ]
        },
        { ...... },
        { ...... },
        ......
      ]
    },
    "<paperId of a paper in Semantic Scholar>": { ...... },
    "<paperId of a paper in Semantic Scholar>": { ...... },
    ......
  },
  "edges": [
    [
        "<paperId of a paper in Semantic Scholar>",
        "<paperId of a reference in the above paper>"
    ],
    [ ...... ],
    [ ...... ],
    ......
  ]

Write to a Neo4J database

docker pull neo4j
docker run --rm -it -p 7474:7474 -p 7687:7687 -v $(pwd)save/neo4j:/data -e NEO4J_AUTH=none neo4j

e.g. write to neo4j://localhost:7687:

python -m dblp_crawler -k video -k edge -p 27d5dc70280c8628f181a7f8881912025f808256 neo4j --uri neo4j://localhost:7687

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.10.3

Sep 5, 2024

2.10.2

May 18, 2024

2.10.1

May 16, 2024

2.10

Apr 27, 2024

2.9.4.1

Apr 22, 2024

2.9.4

Apr 22, 2024

2.9.3.1

Apr 22, 2024

2.9.3

Apr 22, 2024

2.9.2

Apr 22, 2024

2.9.1

Apr 22, 2024

2.9

Apr 22, 2024

2.8.4

Apr 22, 2024

2.8.3

Apr 15, 2024

2.8.2

Mar 23, 2024

2.8.1

Mar 23, 2024

2.8

Mar 20, 2024

2.7

Feb 27, 2024

2.6.2

Feb 27, 2024

2.6.1

Feb 27, 2024

2.6

Feb 27, 2024

2.5.3

Feb 24, 2024

2.5.2

Feb 24, 2024

2.5.1

Jan 2, 2024

2.5

Jan 2, 2024

2.3.2

Dec 21, 2023

2.3.1

Dec 18, 2023

2.3

Dec 18, 2023

2.2

Dec 17, 2023

2.1

Dec 17, 2023

2.0

Dec 17, 2023

1.8.2

Dec 17, 2023

1.8.1.1

Dec 16, 2023

1.8.1

Dec 16, 2023

1.8

Dec 16, 2023

1.7.5

Dec 16, 2023

1.7.4

Dec 16, 2023

1.7.3.1

Dec 7, 2023

1.7.3

Dec 6, 2023

1.7.2

Dec 6, 2023

1.7.1

Dec 6, 2023

1.7

Dec 6, 2023

1.6.3

Dec 6, 2023

1.6.2

Dec 6, 2023

1.6.1

Dec 6, 2023

1.5.1

Nov 20, 2023

This version

1.5

Nov 20, 2023

1.3.5.1

Nov 20, 2023

1.3.5

Nov 20, 2023

1.3.4

Nov 20, 2023

1.3.3

Nov 20, 2023

1.3.2

Nov 19, 2023

1.3.1

Nov 19, 2023

1.3.0

Nov 19, 2023

1.2.0

Nov 19, 2023

1.1.0

Nov 18, 2023

1.0.0

Nov 18, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citation_crawler-1.5.tar.gz (12.2 kB view details)

Uploaded Nov 20, 2023 Source

File details

Details for the file citation_crawler-1.5.tar.gz.

File metadata

Download URL: citation_crawler-1.5.tar.gz
Upload date: Nov 20, 2023
Size: 12.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for citation_crawler-1.5.tar.gz
Algorithm	Hash digest
SHA256	`f2e01bee3f61df3d3d1c1c24a24271f6da0e9e6040084fba890890b5b14b1e41`
MD5	`d902f2a8d21aa0d12a0ec881a1d4d1c2`
BLAKE2b-256	`8a3ed910b0ff4d0bddfb548a58716c9292461681aa5f5d384bd3217026e4727d`