Weave academic paper data from various sources (DBLP, Semantic Scholar) into graph databases.

These details have not been verified by PyPI

Project links

Homepage

Project description

PaperWeaver

PaperWeaver is a tool for weaving academic paper data from various sources (DBLP, Semantic Scholar) into graph databases (Neo4j). It uses BFS traversal to explore and collect papers, authors, venues, citations, and references, building a comprehensive academic knowledge graph.

Features

Multiple Data Sources
- DBLP API - bibliographic information
- Semantic Scholar API - citations and references
Graph Database Output
- Neo4j - store papers, authors, venues and their relationships
Flexible Caching
- In-memory cache for simple use cases
- Redis cache for distributed and persistent caching
BFS Traversal
- Start from authors, papers, or venues
- Automatically discover related entities through citations, references, and authorship

Installation

pip install paper-weaver

Or install from source:

git clone https://github.com/yindaheng98/PaperWeaver.git
cd PaperWeaver
pip install -e .

Quick Start

Basic Usage

Start from an author and explore their papers and related venues:

paper-weaver \
  --init-mode authors \
  --init-dblp-pids h/KaimingHe \
  --datadst-neo4j-uri bolt://localhost:7687 \
  --datadst-neo4j-user neo4j \
  --datadst-neo4j-password your-password \
  -n 10 -v

Start from Papers

paper-weaver \
  --init-mode papers \
  --init-dblp-record-keys conf/cvpr/HeZRS16 journals/pami/HeZRS16 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -n 5 -v

Start from Venues

paper-weaver \
  --init-mode venues \
  --init-dblp-venue-keys db/conf/cvpr/cvpr2016 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -n 5 -v

Type System and How It Works

PaperWeaver is a typed, composable BFS pipeline. You can treat it as four pluggable layers:

Domain Types (Paper, Author, Venue)
- Each entity is a dataclass with identifiers: set[str].
- The identifier set is the core identity model: if two objects share any identifier, cache and storage layers can merge them as one logical entity.
Behavior Contracts
- DataSrc: how to fetch info and neighbors (papers, authors, venues, references, citations).
- DataDst: how to persist entity info and links.
- WeaverCacheIface: how to remember fetched info, pending neighbors, and committed links.
- WeaverInitializerIface: how to provide BFS seeds.
Traversal Interfaces
- Relation interfaces such as Author2PapersWeaverIface, Paper2AuthorsWeaverIface, Paper2VenuesWeaverIface, Venue2PapersWeaverIface, etc. all share one cached BFS step implementation.
- Author2Paper2VenueWeaver composes multiple relation interfaces and runs them in sequence.
Concrete Implementations
- Data sources: DBLPDataSrc, SemanticScholarDataSrc
- Destination: Neo4jDataDst
- Cache: memory/redis/hybrid FullWeaverCache built by HybridCacheBuilder
- Initializers: DBLP initializers for papers/authors/venues/venue-index

Runtime Logic (from types to behavior)

For each BFS step, the pipeline does:

Resolve entity identity using identifiers (cache registry may merge aliases).
Load parent info from cache; on miss, fetch from DataSrc, save to DataDst, then cache it.
Load pending children from cache; on miss, fetch from DataSrc, register them in pending cache.
For each child: fetch/save/cache child info if needed.
Create link in DataDst only if commit cache says the link is new.

This design gives:

low duplicate API calls (info/pending caching),
low duplicate writes (committed-link tracking),
stable cross-source identity (identifier merging).

How to Use This Model

CLI path: choose one implementation per layer (initializer + datasrc + cache + datadst), then run BFS.
Python API path: instantiate those same layer objects directly and call await weaver.bfs(...).

In both paths, usage is the same architecture: only implementations change.

What You Can Extend

You can extend at any layer independently:

New DataSrc: implement the DataSrc abstract methods for your API.
New DataDst: implement DataDst to write into another graph/DB.
New cache backend: implement IdentifierRegistryIface, InfoStorageIface, PendingListStorageIface, CommittedLinkStorageIface, then compose with cache classes.
New initializer: implement one of fetch_papers(), fetch_authors(), or fetch_venues().
New traversal recipe: compose existing relation interfaces into a new weaver class.

Built-in Extension Examples

Switch data source without changing traversal logic
- DBLPDataSrc provides strong bibliographic/venue coverage.
- SemanticScholarDataSrc provides references/citations APIs.
- Both satisfy DataSrc, so the same weaver can run with either source.
Switch cache backend without changing source/destination
- Use HybridCacheBuilder().with_all_memory() for local runs.
- Use redis components (with_all_redis(...) or mixed with_memory_* + with_redis_*) for persistent/distributed runs.
- Traversal code remains unchanged because all variants satisfy the same cache interfaces.
Expand initializer granularity
- DBLPAuthorsInitializer seeds by author pids.
- DBLPVenueIndexInitializer expands a DBLP venue index into many venues before BFS.
- Both plug into the same initializer contract and only change the BFS starting frontier.
Create a custom combined weaver
- Existing Author2Paper2VenueWeaver is a template: it composes relation interfaces and defines init() + bfs_once().
- You can build another combination (for example including reference/citation traversal) by composing the corresponding interfaces with a compatible cache.

Command-Line Options

Weaver Options

Option	Default	Description
`--weaver-type`	`a2p2v`	Weaver type
`-n, --max-iterations`	`0`	Max BFS iterations (0 = until no new data)
`-v, --verbose`	-	Increase verbosity (-v: INFO, -vv: DEBUG)

Initialization Options

Option	Default	Description
`--init-type`	`dblp`	Initializer type
`--init-mode`	`authors`	Initialization mode: `papers`, `authors`, or `venues`
`--init-dblp-record-keys`	-	DBLP record keys (e.g., `conf/cvpr/HeZRS16`)
`--init-dblp-pids`	-	DBLP person IDs (e.g., `h/KaimingHe`)
`--init-dblp-venue-keys`	-	DBLP venue keys (e.g., `db/conf/cvpr/cvpr2016`)

Data Source Options

Option	Default	Description
`--datasrc-type`	`dblp`	Data source: `dblp` or `semanticscholar`
`--datasrc-cache-mode`	`memory`	Cache backend: `memory` or `redis`
`--datasrc-redis-url`	`redis://localhost:6379`	Redis URL for data source cache
`--datasrc-max-concurrent`	`10`	Maximum concurrent HTTP requests
`--datasrc-http-proxy`	-	HTTP proxy URL
`--datasrc-http-timeout`	`30`	HTTP timeout in seconds
`--datasrc-ss-api-key`	-	Semantic Scholar API key

Cache Options

Option	Default	Description
`--cache-mode`	`memory`	Cache backend: `memory` or `redis`
`--cache-redis-url`	`redis://localhost:6379`	Default Redis URL
`--cache-redis-prefix`	`paper-weaver-cache`	Redis key prefix

Neo4j Options

Option	Default	Description
`--datadst-neo4j-uri`	`bolt://localhost:7687`	Neo4j connection URI
`--datadst-neo4j-user`	`neo4j`	Neo4j username
`--datadst-neo4j-password`	`neo4j`	Neo4j password
`--datadst-neo4j-database`	`neo4j`	Neo4j database name

Using with Redis Cache

For large-scale crawling, use Redis for persistent caching:

paper-weaver \
  --init-mode authors \
  --init-dblp-pids h/KaimingHe \
  --cache-mode redis \
  --cache-redis-url redis://localhost:6379 \
  --datasrc-cache-mode redis \
  --datasrc-redis-url redis://localhost:6379 \
  --datadst-neo4j-uri bolt://localhost:7687 \
  -v

Graph Schema

PaperWeaver creates the following nodes and relationships in Neo4j:

Nodes

Paper: title, year, venue, doi, etc.
Author: name, pid, orcid, etc.
Venue: name, type (journal/proceedings/book)

Relationships

(Author)-[:AUTHORED]->(Paper)
(Paper)-[:PUBLISHED_IN]->(Venue)
(Paper)-[:CITES]->(Paper)
(Paper)-[:REFERENCES]->(Paper)

Python API

import asyncio
from paper_weaver import Author2Paper2VenueWeaver
from paper_weaver.datasrc.dblp import DBLPDataSrc
from paper_weaver.datadst.neo4j import Neo4jDataDst
from paper_weaver.cache import HybridCacheBuilder
from paper_weaver.initializer.dblp import DBLPAuthorsInitializer

async def main():
    # Setup components
    datasrc = DBLPDataSrc()
    cache = HybridCacheBuilder().with_all_memory().build_weaver_cache()
    initializer = DBLPAuthorsInitializer(["h/KaimingHe"])
    
    # Setup Neo4j
    from neo4j import AsyncGraphDatabase
    driver = AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
    session = driver.session(database="neo4j")
    datadst = Neo4jDataDst(session)
    
    # Create weaver and run
    weaver = Author2Paper2VenueWeaver(
        src=datasrc,
        dst=datadst,
        cache=cache,
        initializer=initializer
    )
    
    total = await weaver.bfs(max_iterations=10)
    print(f"Processed {total} items")
    
    await driver.close()

asyncio.run(main())

Requirements

Python 3.10+
Neo4j 4.0+ (for graph storage)
Redis (optional, for distributed caching)

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.3.2

Mar 6, 2026

1.3.1

Mar 5, 2026

1.3.0

Mar 5, 2026

1.2.2

Mar 4, 2026

1.2.1

Mar 4, 2026

This version

1.2.0

Mar 4, 2026

1.1.5

Dec 22, 2025

1.1.4

Dec 22, 2025

1.1.3

Dec 5, 2025

1.1.2

Dec 5, 2025

1.1.1

Dec 5, 2025

1.1.0

Dec 5, 2025

1.0.1

Dec 4, 2025

1.0.0

Dec 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_weaver-1.2.0.tar.gz (57.1 kB view details)

Uploaded Mar 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paper_weaver-1.2.0-py3-none-any.whl (87.0 kB view details)

Uploaded Mar 4, 2026 Python 3

File details

Details for the file paper_weaver-1.2.0.tar.gz.

File metadata

Download URL: paper_weaver-1.2.0.tar.gz
Upload date: Mar 4, 2026
Size: 57.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paper_weaver-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`2dbb86b4470373b56b1b853ac1038244808bec0bd4b57a4f93cf1fe322b233df`
MD5	`d965014d88ae94fea0d30565e58471fc`
BLAKE2b-256	`b59f7b81d5a86707252baed01622875858303ae260fd59e66ae815ac0aee404a`

See more details on using hashes here.

File details

Details for the file paper_weaver-1.2.0-py3-none-any.whl.

File metadata

Download URL: paper_weaver-1.2.0-py3-none-any.whl
Upload date: Mar 4, 2026
Size: 87.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paper_weaver-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f559c1a8833e60b3e8d4d50dd8b5f27d2cba0633bf2ce4bb8d7be68b361210ee`
MD5	`a5c753c0f5416bd69885303825b5f769`
BLAKE2b-256	`2a8056f5356749c3fcf28969c39463c029103a1dbfbc3d97d38d328197f202fb`

See more details on using hashes here.

paper-weaver 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PaperWeaver

Features

Installation

Quick Start

Basic Usage

Start from Papers

Start from Venues

Type System and How It Works

Runtime Logic (from types to behavior)

How to Use This Model

What You Can Extend

Built-in Extension Examples

Command-Line Options

Weaver Options

Initialization Options

Data Source Options

Cache Options

Neo4j Options

Using with Redis Cache

Graph Schema

Nodes

Relationships

Python API

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes