Skip to main content

Weave academic paper data from various sources (DBLP, Semantic Scholar) into graph databases.

Project description

PaperWeaver

PyPI version Python 3.10+

PaperWeaver is a tool for weaving academic paper data from various sources (DBLP, Semantic Scholar) into graph databases (Neo4j). It uses BFS traversal to explore and collect papers, authors, venues, citations, and references, building a comprehensive academic knowledge graph.

Features

  • Multiple Data Sources

    • DBLP API - bibliographic information
    • Semantic Scholar API - citations and references
  • Graph Database Output

    • Neo4j - store papers, authors, venues and their relationships
  • Flexible Caching

    • In-memory cache for simple use cases
    • Redis cache for distributed and persistent caching
  • BFS Traversal

    • Start from authors, papers, or venues
    • Automatically discover related entities through citations, references, and authorship

Installation

pip install paper-weaver

Or install from source:

git clone https://github.com/yindaheng98/PaperWeaver.git
cd PaperWeaver
pip install -e .

Quick Start

Basic Usage

Start from an author and explore their papers and related venues:

paper-weaver \
  --init-mode authors \
  --init-dblp-pids h/KaimingHe \
  --datadst-neo4j-uri bolt://localhost:7687 \
  --datadst-neo4j-user neo4j \
  --datadst-neo4j-password your-password \
  -n 10 -v

Start from Papers

paper-weaver \
  --init-mode papers \
  --init-dblp-record-keys conf/cvpr/HeZRS16 journals/pami/HeZRS16 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -n 5 -v

Start from Venues

paper-weaver \
  --init-mode venues \
  --init-dblp-venue-keys db/conf/cvpr/cvpr2016 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -n 5 -v

Command-Line Options

Weaver Options

Option Default Description
--weaver-type a2p2v Weaver type
-n, --max-iterations 0 Max BFS iterations (0 = until no new data)
-v, --verbose - Increase verbosity (-v: INFO, -vv: DEBUG)

Initialization Options

Option Default Description
--init-type dblp Initializer type
--init-mode authors Initialization mode: papers, authors, or venues
--init-dblp-record-keys - DBLP record keys (e.g., conf/cvpr/HeZRS16)
--init-dblp-pids - DBLP person IDs (e.g., h/KaimingHe)
--init-dblp-venue-keys - DBLP venue keys (e.g., db/conf/cvpr/cvpr2016)

Data Source Options

Option Default Description
--datasrc-type dblp Data source: dblp or semanticscholar
--datasrc-cache-mode memory Cache backend: memory or redis
--datasrc-redis-url redis://localhost:6379 Redis URL for data source cache
--datasrc-max-concurrent 10 Maximum concurrent HTTP requests
--datasrc-http-proxy - HTTP proxy URL
--datasrc-http-timeout 30 HTTP timeout in seconds
--datasrc-ss-api-key - Semantic Scholar API key

Cache Options

Option Default Description
--cache-mode memory Cache backend: memory or redis
--cache-redis-url redis://localhost:6379 Default Redis URL
--cache-redis-prefix paper-weaver-cache Redis key prefix

Neo4j Options

Option Default Description
--datadst-neo4j-uri bolt://localhost:7687 Neo4j connection URI
--datadst-neo4j-user neo4j Neo4j username
--datadst-neo4j-password neo4j Neo4j password
--datadst-neo4j-database neo4j Neo4j database name

Using with Redis Cache

For large-scale crawling, use Redis for persistent caching:

paper-weaver \
  --init-mode authors \
  --init-dblp-pids h/KaimingHe \
  --cache-mode redis \
  --cache-redis-url redis://localhost:6379 \
  --datasrc-cache-mode redis \
  --datasrc-redis-url redis://localhost:6379 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -v

Graph Schema

PaperWeaver creates the following nodes and relationships in Neo4j:

Nodes

  • Paper: title, year, venue, doi, etc.
  • Author: name, pid, orcid, etc.
  • Venue: name, type (journal/proceedings/book)

Relationships

  • (Author)-[:AUTHORED]->(Paper)
  • (Paper)-[:PUBLISHED_IN]->(Venue)
  • (Paper)-[:CITES]->(Paper)
  • (Paper)-[:REFERENCES]->(Paper)

Python API

import asyncio
from paper_weaver import Author2Paper2VenueWeaver
from paper_weaver.datasrc.dblp import DBLPDataSrc
from paper_weaver.datadst.neo4j import Neo4jDataDst
from paper_weaver.cache import HybridCacheBuilder
from paper_weaver.initializer.dblp import DBLPAuthorsInitializer

async def main():
    # Setup components
    datasrc = DBLPDataSrc()
    cache = HybridCacheBuilder().with_all_memory().build_weaver_cache()
    initializer = DBLPAuthorsInitializer(["h/KaimingHe"])
    
    # Setup Neo4j
    from neo4j import AsyncGraphDatabase
    driver = AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
    session = driver.session(database="neo4j")
    datadst = Neo4jDataDst(session)
    
    # Create weaver and run
    weaver = Author2Paper2VenueWeaver(
        src=datasrc,
        dst=datadst,
        cache=cache,
        initializer=initializer
    )
    
    total = await weaver.bfs(max_iterations=10)
    print(f"Processed {total} items")
    
    await driver.close()

asyncio.run(main())

Requirements

  • Python 3.10+
  • Neo4j 4.0+ (for graph storage)
  • Redis (optional, for distributed caching)

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_weaver-1.1.1.tar.gz (50.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_weaver-1.1.1-py3-none-any.whl (79.9 kB view details)

Uploaded Python 3

File details

Details for the file paper_weaver-1.1.1.tar.gz.

File metadata

  • Download URL: paper_weaver-1.1.1.tar.gz
  • Upload date:
  • Size: 50.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paper_weaver-1.1.1.tar.gz
Algorithm Hash digest
SHA256 d5cc5fe321cfbe9a696b2166961efc75fe5bfed8b139c9f0c6abe02bc6126581
MD5 0551d5a42674e01c8a282822c2761a23
BLAKE2b-256 3a5f936d7ab33302303b9651f237b92a0737edbe3eff27a6c41f95669531896f

See more details on using hashes here.

File details

Details for the file paper_weaver-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: paper_weaver-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 79.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paper_weaver-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bba3f3f9c6032dcaeaafcd9df6383635550eb43d567b2f5c51e99c192b2abdd8
MD5 e310ef6e642a1b5cb8fe5c5cc65c5392
BLAKE2b-256 d694407ef26387e91f698d5b297c9335f08fcb7a06e4d42825299979d8dbc236

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page