Skip to main content

Weave academic paper data from various sources (DBLP, Semantic Scholar) into graph databases.

Project description

PaperWeaver

PyPI version Python 3.10+

PaperWeaver is a tool for weaving academic paper data from various sources (DBLP, Semantic Scholar) into graph databases (Neo4j). It uses BFS traversal to explore and collect papers, authors, venues, citations, and references, building a comprehensive academic knowledge graph.

Features

  • Multiple Data Sources

    • DBLP API - bibliographic information
    • Semantic Scholar API - citations and references
  • Graph Database Output

    • Neo4j - store papers, authors, venues and their relationships
  • Flexible Caching

    • In-memory cache for simple use cases
    • Redis cache for distributed and persistent caching
  • BFS Traversal

    • Start from authors, papers, or venues
    • Automatically discover related entities through citations, references, and authorship

Installation

pip install paper-weaver

Or install from source:

git clone https://github.com/yindaheng98/PaperWeaver.git
cd PaperWeaver
pip install -e .

Quick Start

Basic Usage

Start from an author and explore their papers and related venues:

paper-weaver \
  --init-mode authors \
  --init-dblp-pids h/KaimingHe \
  --datadst-neo4j-uri bolt://localhost:7687 \
  --datadst-neo4j-user neo4j \
  --datadst-neo4j-password your-password \
  -n 10 -v

Start from Papers

paper-weaver \
  --init-mode papers \
  --init-dblp-record-keys conf/cvpr/HeZRS16 journals/pami/HeZRS16 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -n 5 -v

Start from Venues

paper-weaver \
  --init-mode venues \
  --init-dblp-venue-keys db/conf/cvpr/cvpr2016 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -n 5 -v

Command-Line Options

Weaver Options

Option Default Description
--weaver-type a2p2v Weaver type
-n, --max-iterations 0 Max BFS iterations (0 = until no new data)
-v, --verbose - Increase verbosity (-v: INFO, -vv: DEBUG)

Initialization Options

Option Default Description
--init-type dblp Initializer type
--init-mode authors Initialization mode: papers, authors, or venues
--init-dblp-record-keys - DBLP record keys (e.g., conf/cvpr/HeZRS16)
--init-dblp-pids - DBLP person IDs (e.g., h/KaimingHe)
--init-dblp-venue-keys - DBLP venue keys (e.g., db/conf/cvpr/cvpr2016)

Data Source Options

Option Default Description
--datasrc-type dblp Data source: dblp or semanticscholar
--datasrc-cache-mode memory Cache backend: memory or redis
--datasrc-redis-url redis://localhost:6379 Redis URL for data source cache
--datasrc-max-concurrent 10 Maximum concurrent HTTP requests
--datasrc-http-proxy - HTTP proxy URL
--datasrc-http-timeout 30 HTTP timeout in seconds
--datasrc-ss-api-key - Semantic Scholar API key

Cache Options

Option Default Description
--cache-mode memory Cache backend: memory or redis
--cache-redis-url redis://localhost:6379 Default Redis URL
--cache-redis-prefix paper-weaver-cache Redis key prefix

Neo4j Options

Option Default Description
--datadst-neo4j-uri bolt://localhost:7687 Neo4j connection URI
--datadst-neo4j-user neo4j Neo4j username
--datadst-neo4j-password neo4j Neo4j password
--datadst-neo4j-database neo4j Neo4j database name

Using with Redis Cache

For large-scale crawling, use Redis for persistent caching:

paper-weaver \
  --init-mode authors \
  --init-dblp-pids h/KaimingHe \
  --cache-mode redis \
  --cache-redis-url redis://localhost:6379 \
  --datasrc-cache-mode redis \
  --datasrc-redis-url redis://localhost:6379 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -v

Graph Schema

PaperWeaver creates the following nodes and relationships in Neo4j:

Nodes

  • Paper: title, year, venue, doi, etc.
  • Author: name, pid, orcid, etc.
  • Venue: name, type (journal/proceedings/book)

Relationships

  • (Author)-[:AUTHORED]->(Paper)
  • (Paper)-[:PUBLISHED_IN]->(Venue)
  • (Paper)-[:CITES]->(Paper)
  • (Paper)-[:REFERENCES]->(Paper)

Python API

import asyncio
from paper_weaver import Author2Paper2VenueWeaver
from paper_weaver.datasrc.dblp import DBLPDataSrc
from paper_weaver.datadst.neo4j import Neo4jDataDst
from paper_weaver.cache import HybridCacheBuilder
from paper_weaver.initializer.dblp import DBLPAuthorsInitializer

async def main():
    # Setup components
    datasrc = DBLPDataSrc()
    cache = HybridCacheBuilder().with_all_memory().build_weaver_cache()
    initializer = DBLPAuthorsInitializer(["h/KaimingHe"])
    
    # Setup Neo4j
    from neo4j import AsyncGraphDatabase
    driver = AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
    session = driver.session(database="neo4j")
    datadst = Neo4jDataDst(session)
    
    # Create weaver and run
    weaver = Author2Paper2VenueWeaver(
        src=datasrc,
        dst=datadst,
        cache=cache,
        initializer=initializer
    )
    
    total = await weaver.bfs(max_iterations=10)
    print(f"Processed {total} items")
    
    await driver.close()

asyncio.run(main())

Requirements

  • Python 3.10+
  • Neo4j 4.0+ (for graph storage)
  • Redis (optional, for distributed caching)

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_weaver-1.1.2.tar.gz (50.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_weaver-1.1.2-py3-none-any.whl (79.9 kB view details)

Uploaded Python 3

File details

Details for the file paper_weaver-1.1.2.tar.gz.

File metadata

  • Download URL: paper_weaver-1.1.2.tar.gz
  • Upload date:
  • Size: 50.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paper_weaver-1.1.2.tar.gz
Algorithm Hash digest
SHA256 fc14590f5357230a5ef9bdc3cc3f8ae2dbe0427ebf7abf0943972aeed322c4d9
MD5 b63d951c63a5efac093137be08ed945d
BLAKE2b-256 1d36a7317162534c374cf63699bbb65d448089e1e19764f4105470dbe96c4de7

See more details on using hashes here.

File details

Details for the file paper_weaver-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: paper_weaver-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 79.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paper_weaver-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e10b5c4a91c2006f818ebb654a3b7c62bd4aec3b7c958a8548e7f8e75c52e65a
MD5 12e7eb9b88142f0187f3491f62327fe9
BLAKE2b-256 1f627e7592b881819a749a41ae25d94d3543ea6a74c4cf4b4c9dd5941b127976

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page