Weave academic paper data from various sources (DBLP, Semantic Scholar) into graph databases.
Project description
PaperWeaver
PaperWeaver is a tool for weaving academic paper data from various sources (DBLP, Semantic Scholar) into graph databases (Neo4j). It uses BFS traversal to explore and collect papers, authors, venues, citations, and references, building a comprehensive academic knowledge graph.
Features
-
Multiple Data Sources
- DBLP API - bibliographic information
- Semantic Scholar API - citations and references
-
Graph Database Output
- Neo4j - store papers, authors, venues and their relationships
-
Flexible Caching
- In-memory cache for simple use cases
- Redis cache for distributed and persistent caching
-
BFS Traversal
- Start from authors, papers, or venues
- Automatically discover related entities through citations, references, and authorship
Installation
pip install paper-weaver
Or install from source:
git clone https://github.com/yindaheng98/PaperWeaver.git
cd PaperWeaver
pip install -e .
Quick Start
Basic Usage
Start from an author and explore their papers and related venues:
paper-weaver \
--init-mode authors \
--init-dblp-pids h/KaimingHe \
--datadst-neo4j-uri bolt://localhost:7687 \
--datadst-neo4j-user neo4j \
--datadst-neo4j-password your-password \
-n 10 -v
Start from Papers
paper-weaver \
--init-mode papers \
--init-dblp-record-keys conf/cvpr/HeZRS16 journals/pami/HeZRS16 \
--datadst-neo4j-uri bolt://localhost:7687 \
-n 5 -v
Start from Venues
paper-weaver \
--init-mode venues \
--init-dblp-venue-keys db/conf/cvpr/cvpr2016 \
--datadst-neo4j-uri bolt://localhost:7687 \
-n 5 -v
Type System and How It Works
PaperWeaver is a typed, composable BFS pipeline. You can treat it as four pluggable layers:
-
Domain Types (
Paper,Author,Venue)- Each entity is a dataclass with
identifiers: set[str]. - The identifier set is the core identity model: if two objects share any identifier, cache and storage layers can merge them as one logical entity.
- Each entity is a dataclass with
-
Behavior Contracts
DataSrc: how to fetch info and neighbors (papers, authors, venues, references, citations).DataDst: how to persist entity info and links.WeaverCacheIface: how to remember fetched info, pending neighbors, and committed links.WeaverInitializerIface: how to provide BFS seeds.
-
Traversal Interfaces
- Relation interfaces such as
Author2PapersWeaverIface,Paper2AuthorsWeaverIface,Paper2VenuesWeaverIface,Venue2PapersWeaverIface, etc. all share one cached BFS step implementation. Author2Paper2VenueWeavercomposes multiple relation interfaces and runs them in sequence.
- Relation interfaces such as
-
Concrete Implementations
- Data sources:
DBLPDataSrc,SemanticScholarDataSrc - Destination:
Neo4jDataDst - Cache: memory/redis/hybrid
FullWeaverCachebuilt byHybridCacheBuilder - Initializers: DBLP initializers for papers/authors/venues/venue-index
- Data sources:
Runtime Logic (from types to behavior)
For each BFS step, the pipeline does:
- Resolve entity identity using identifiers (cache registry may merge aliases).
- Load parent info from cache; on miss, fetch from
DataSrc, save toDataDst, then cache it. - Load pending children from cache; on miss, fetch from
DataSrc, register them in pending cache. - For each child: fetch/save/cache child info if needed.
- Create link in
DataDstonly if commit cache says the link is new.
This design gives:
- low duplicate API calls (info/pending caching),
- low duplicate writes (committed-link tracking),
- stable cross-source identity (identifier merging).
How to Use This Model
- CLI path: choose one implementation per layer (initializer + datasrc + cache + datadst), then run BFS.
- Python API path: instantiate those same layer objects directly and call
await weaver.bfs(...).
In both paths, usage is the same architecture: only implementations change.
What You Can Extend
You can extend at any layer independently:
- New
DataSrc: implement theDataSrcabstract methods for your API. - New
DataDst: implementDataDstto write into another graph/DB. - New cache backend: implement
IdentifierRegistryIface,InfoStorageIface,PendingListStorageIface,CommittedLinkStorageIface, then compose with cache classes. - New initializer: implement one of
fetch_papers(),fetch_authors(), orfetch_venues(). - New traversal recipe: compose existing relation interfaces into a new weaver class.
Built-in Extension Examples
-
Switch data source without changing traversal logic
DBLPDataSrcprovides strong bibliographic/venue coverage.SemanticScholarDataSrcprovides references/citations APIs.- Both satisfy
DataSrc, so the same weaver can run with either source.
-
Switch cache backend without changing source/destination
- Use
HybridCacheBuilder().with_all_memory()for local runs. - Use redis components (
with_all_redis(...)or mixedwith_memory_*+with_redis_*) for persistent/distributed runs. - Traversal code remains unchanged because all variants satisfy the same cache interfaces.
- Use
-
Expand initializer granularity
DBLPAuthorsInitializerseeds by author pids.DBLPVenueIndexInitializerexpands a DBLP venue index into many venues before BFS.- Both plug into the same initializer contract and only change the BFS starting frontier.
-
Create a custom combined weaver
- Existing
Author2Paper2VenueWeaveris a template: it composes relation interfaces and definesinit()+bfs_once(). - You can build another combination (for example including reference/citation traversal) by composing the corresponding interfaces with a compatible cache.
- Existing
Command-Line Options
Weaver Options
| Option | Default | Description |
|---|---|---|
--weaver-type |
a2p2v |
Weaver type |
-n, --max-iterations |
0 |
Max BFS iterations (0 = until no new data) |
-v, --verbose |
- | Increase verbosity (-v: INFO, -vv: DEBUG) |
Initialization Options
| Option | Default | Description |
|---|---|---|
--init-type |
dblp |
Initializer type |
--init-mode |
authors |
Initialization mode: papers, authors, or venues |
--init-dblp-record-keys |
- | DBLP record keys (e.g., conf/cvpr/HeZRS16) |
--init-dblp-pids |
- | DBLP person IDs (e.g., h/KaimingHe) |
--init-dblp-venue-keys |
- | DBLP venue keys (e.g., db/conf/cvpr/cvpr2016) |
Data Source Options
| Option | Default | Description |
|---|---|---|
--datasrc-type |
dblp |
Data source: dblp or semanticscholar |
--datasrc-cache-mode |
memory |
Cache backend: memory or redis |
--datasrc-redis-url |
redis://localhost:6379 |
Redis URL for data source cache |
--datasrc-max-concurrent |
10 |
Maximum concurrent HTTP requests |
--datasrc-http-proxy |
- | HTTP proxy URL |
--datasrc-http-timeout |
30 |
HTTP timeout in seconds |
--datasrc-ss-api-key |
- | Semantic Scholar API key |
Cache Options
| Option | Default | Description |
|---|---|---|
--cache-mode |
memory |
Cache backend: memory or redis |
--cache-redis-url |
redis://localhost:6379 |
Default Redis URL |
--cache-redis-prefix |
paper-weaver-cache |
Redis key prefix |
Neo4j Options
| Option | Default | Description |
|---|---|---|
--datadst-neo4j-uri |
bolt://localhost:7687 |
Neo4j connection URI |
--datadst-neo4j-user |
neo4j |
Neo4j username |
--datadst-neo4j-password |
neo4j |
Neo4j password |
--datadst-neo4j-database |
neo4j |
Neo4j database name |
Using with Redis Cache
For large-scale crawling, use Redis for persistent caching:
paper-weaver \
--init-mode authors \
--init-dblp-pids h/KaimingHe \
--cache-mode redis \
--cache-redis-url redis://localhost:6379 \
--datasrc-cache-mode redis \
--datasrc-redis-url redis://localhost:6379 \
--datadst-neo4j-uri bolt://localhost:7687 \
-v
Graph Schema
PaperWeaver creates the following nodes and relationships in Neo4j:
Nodes
- Paper:
title,year,venue,doi, etc. - Author:
name,pid,orcid, etc. - Venue:
name,type(journal/proceedings/book)
Relationships
(Author)-[:AUTHORED]->(Paper)(Paper)-[:PUBLISHED_IN]->(Venue)(Paper)-[:CITES]->(Paper)(Paper)-[:REFERENCES]->(Paper)
Python API
import asyncio
from paper_weaver import Author2Paper2VenueWeaver
from paper_weaver.datasrc.dblp import DBLPDataSrc
from paper_weaver.datadst.neo4j import Neo4jDataDst
from paper_weaver.cache import HybridCacheBuilder
from paper_weaver.initializer.dblp import DBLPAuthorsInitializer
async def main():
# Setup components
datasrc = DBLPDataSrc()
cache = HybridCacheBuilder().with_all_memory().build_weaver_cache()
initializer = DBLPAuthorsInitializer(["h/KaimingHe"])
# Setup Neo4j
from neo4j import AsyncGraphDatabase
driver = AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
session = driver.session(database="neo4j")
datadst = Neo4jDataDst(session)
# Create weaver and run
weaver = Author2Paper2VenueWeaver(
src=datasrc,
dst=datadst,
cache=cache,
initializer=initializer
)
total = await weaver.bfs(max_iterations=10)
print(f"Processed {total} items")
await driver.close()
asyncio.run(main())
Requirements
- Python 3.10+
- Neo4j 4.0+ (for graph storage)
- Redis (optional, for distributed caching)
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper_weaver-1.3.2.tar.gz.
File metadata
- Download URL: paper_weaver-1.3.2.tar.gz
- Upload date:
- Size: 59.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90422eccbccd60a4771c0cd71f76b1874addaaffdbb58a08edbfbe0f7103291e
|
|
| MD5 |
0b4c12db4453953af8219cfbadc00ea5
|
|
| BLAKE2b-256 |
4b431aa4348fa9065b908cc5fa019248eba9dd514eeb14d59d0cd51f456618df
|
File details
Details for the file paper_weaver-1.3.2-py3-none-any.whl.
File metadata
- Download URL: paper_weaver-1.3.2-py3-none-any.whl
- Upload date:
- Size: 92.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59c9a4e1319dd87d94759aaa780d16ed8d353cefe180d585c91367a781d91fdd
|
|
| MD5 |
9ba5ce33a44d7406d3b571db5f953b16
|
|
| BLAKE2b-256 |
9563a52ef1ab2128d6f0f8d176eedf8f6aa685b8df0ac560aa4a64563d2c6af4
|