Skip to main content

Weave academic paper data from various sources (DBLP, Semantic Scholar) into graph databases.

Project description

PaperWeaver

PyPI version Python 3.10+

PaperWeaver is a tool for weaving academic paper data from various sources (DBLP, Semantic Scholar) into graph databases (Neo4j). It uses BFS traversal to explore and collect papers, authors, venues, citations, and references, building a comprehensive academic knowledge graph.

Features

  • Multiple Data Sources

    • DBLP API - bibliographic information
    • Semantic Scholar API - citations and references
  • Graph Database Output

    • Neo4j - store papers, authors, venues and their relationships
  • Flexible Caching

    • In-memory cache for simple use cases
    • Redis cache for distributed and persistent caching
  • BFS Traversal

    • Start from authors, papers, or venues
    • Automatically discover related entities through citations, references, and authorship

Installation

pip install paper-weaver

Or install from source:

git clone https://github.com/yindaheng98/PaperWeaver.git
cd PaperWeaver
pip install -e .

Quick Start

Basic Usage

Start from an author and explore their papers and related venues:

paper-weaver \
  --init-mode authors \
  --init-dblp-pids h/KaimingHe \
  --datadst-neo4j-uri bolt://localhost:7687 \
  --datadst-neo4j-user neo4j \
  --datadst-neo4j-password your-password \
  -n 10 -v

Start from Papers

paper-weaver \
  --init-mode papers \
  --init-dblp-record-keys conf/cvpr/HeZRS16 journals/pami/HeZRS16 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -n 5 -v

Start from Venues

paper-weaver \
  --init-mode venues \
  --init-dblp-venue-keys db/conf/cvpr/cvpr2016 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -n 5 -v

Command-Line Options

Weaver Options

Option Default Description
--weaver-type a2p2v Weaver type
-n, --max-iterations 0 Max BFS iterations (0 = until no new data)
-v, --verbose - Increase verbosity (-v: INFO, -vv: DEBUG)

Initialization Options

Option Default Description
--init-type dblp Initializer type
--init-mode authors Initialization mode: papers, authors, or venues
--init-dblp-record-keys - DBLP record keys (e.g., conf/cvpr/HeZRS16)
--init-dblp-pids - DBLP person IDs (e.g., h/KaimingHe)
--init-dblp-venue-keys - DBLP venue keys (e.g., db/conf/cvpr/cvpr2016)

Data Source Options

Option Default Description
--datasrc-type dblp Data source: dblp or semanticscholar
--datasrc-cache-mode memory Cache backend: memory or redis
--datasrc-redis-url redis://localhost:6379 Redis URL for data source cache
--datasrc-max-concurrent 10 Maximum concurrent HTTP requests
--datasrc-http-proxy - HTTP proxy URL
--datasrc-http-timeout 30 HTTP timeout in seconds
--datasrc-ss-api-key - Semantic Scholar API key

Cache Options

Option Default Description
--cache-mode memory Cache backend: memory or redis
--cache-redis-url redis://localhost:6379 Default Redis URL
--cache-redis-prefix paper-weaver-cache Redis key prefix

Neo4j Options

Option Default Description
--datadst-neo4j-uri bolt://localhost:7687 Neo4j connection URI
--datadst-neo4j-user neo4j Neo4j username
--datadst-neo4j-password neo4j Neo4j password
--datadst-neo4j-database neo4j Neo4j database name

Using with Redis Cache

For large-scale crawling, use Redis for persistent caching:

paper-weaver \
  --init-mode authors \
  --init-dblp-pids h/KaimingHe \
  --cache-mode redis \
  --cache-redis-url redis://localhost:6379 \
  --datasrc-cache-mode redis \
  --datasrc-redis-url redis://localhost:6379 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -v

Graph Schema

PaperWeaver creates the following nodes and relationships in Neo4j:

Nodes

  • Paper: title, year, venue, doi, etc.
  • Author: name, pid, orcid, etc.
  • Venue: name, type (journal/proceedings/book)

Relationships

  • (Author)-[:AUTHORED]->(Paper)
  • (Paper)-[:PUBLISHED_IN]->(Venue)
  • (Paper)-[:CITES]->(Paper)
  • (Paper)-[:REFERENCES]->(Paper)

Python API

import asyncio
from paper_weaver import Author2Paper2VenueWeaver
from paper_weaver.datasrc.dblp import DBLPDataSrc
from paper_weaver.datadst.neo4j import Neo4jDataDst
from paper_weaver.cache import HybridCacheBuilder
from paper_weaver.initializer.dblp import DBLPAuthorsInitializer

async def main():
    # Setup components
    datasrc = DBLPDataSrc()
    cache = HybridCacheBuilder().with_all_memory().build_weaver_cache()
    initializer = DBLPAuthorsInitializer(["h/KaimingHe"])
    
    # Setup Neo4j
    from neo4j import AsyncGraphDatabase
    driver = AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
    session = driver.session(database="neo4j")
    datadst = Neo4jDataDst(session)
    
    # Create weaver and run
    weaver = Author2Paper2VenueWeaver(
        src=datasrc,
        dst=datadst,
        cache=cache,
        initializer=initializer
    )
    
    total = await weaver.bfs(max_iterations=10)
    print(f"Processed {total} items")
    
    await driver.close()

asyncio.run(main())

Requirements

  • Python 3.10+
  • Neo4j 4.0+ (for graph storage)
  • Redis (optional, for distributed caching)

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_weaver-1.1.4.tar.gz (48.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_weaver-1.1.4-py3-none-any.whl (76.3 kB view details)

Uploaded Python 3

File details

Details for the file paper_weaver-1.1.4.tar.gz.

File metadata

  • Download URL: paper_weaver-1.1.4.tar.gz
  • Upload date:
  • Size: 48.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paper_weaver-1.1.4.tar.gz
Algorithm Hash digest
SHA256 d98067622cca2c220c6909ebc804d03c45729085fb72328031b30b5e65166c8b
MD5 7c2acf85ddf72d0dce1b436b4e1fff29
BLAKE2b-256 495fa37090303dd08a4bcee68cff4bf2496c82c239b6d89e3b4899ee7ec2af0a

See more details on using hashes here.

File details

Details for the file paper_weaver-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: paper_weaver-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 76.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paper_weaver-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d61537055dd85472e975df0225fb23945f64d795d80cdd19747acbee3ca4def0
MD5 fb86719f24c63654285354690aa5a31a
BLAKE2b-256 7b66a29de1bc08da7cac6737332cf15c072c9dac19431e39bc031f569c975ef5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page