Weave academic paper data from various sources (DBLP, Semantic Scholar) into graph databases.
Project description
PaperWeaver
PaperWeaver is a tool for weaving academic paper data from various sources (DBLP, Semantic Scholar) into graph databases (Neo4j). It uses BFS traversal to explore and collect papers, authors, venues, citations, and references, building a comprehensive academic knowledge graph.
Features
-
Multiple Data Sources
- DBLP API - bibliographic information
- Semantic Scholar API - citations and references
-
Graph Database Output
- Neo4j - store papers, authors, venues and their relationships
-
Flexible Caching
- In-memory cache for simple use cases
- Redis cache for distributed and persistent caching
-
BFS Traversal
- Start from authors, papers, or venues
- Automatically discover related entities through citations, references, and authorship
Installation
pip install paper-weaver
Or install from source:
git clone https://github.com/yindaheng98/PaperWeaver.git
cd PaperWeaver
pip install -e .
Quick Start
Basic Usage
Start from an author and explore their papers and related venues:
paper-weaver \
--init-mode authors \
--init-dblp-pids h/KaimingHe \
--datadst-neo4j-uri bolt://localhost:7687 \
--datadst-neo4j-user neo4j \
--datadst-neo4j-password your-password \
-n 10 -v
Start from Papers
paper-weaver \
--init-mode papers \
--init-dblp-record-keys conf/cvpr/HeZRS16 journals/pami/HeZRS16 \
--datadst-neo4j-uri bolt://localhost:7687 \
-n 5 -v
Start from Venues
paper-weaver \
--init-mode venues \
--init-dblp-venue-keys db/conf/cvpr/cvpr2016 \
--datadst-neo4j-uri bolt://localhost:7687 \
-n 5 -v
Command-Line Options
Weaver Options
| Option | Default | Description |
|---|---|---|
--weaver-type |
a2p2v |
Weaver type |
-n, --max-iterations |
0 |
Max BFS iterations (0 = until no new data) |
-v, --verbose |
- | Increase verbosity (-v: INFO, -vv: DEBUG) |
Initialization Options
| Option | Default | Description |
|---|---|---|
--init-type |
dblp |
Initializer type |
--init-mode |
authors |
Initialization mode: papers, authors, or venues |
--init-dblp-record-keys |
- | DBLP record keys (e.g., conf/cvpr/HeZRS16) |
--init-dblp-pids |
- | DBLP person IDs (e.g., h/KaimingHe) |
--init-dblp-venue-keys |
- | DBLP venue keys (e.g., db/conf/cvpr/cvpr2016) |
Data Source Options
| Option | Default | Description |
|---|---|---|
--datasrc-type |
dblp |
Data source: dblp or semanticscholar |
--datasrc-cache-mode |
memory |
Cache backend: memory or redis |
--datasrc-redis-url |
redis://localhost:6379 |
Redis URL for data source cache |
--datasrc-max-concurrent |
10 |
Maximum concurrent HTTP requests |
--datasrc-http-proxy |
- | HTTP proxy URL |
--datasrc-http-timeout |
30 |
HTTP timeout in seconds |
--datasrc-ss-api-key |
- | Semantic Scholar API key |
Cache Options
| Option | Default | Description |
|---|---|---|
--cache-mode |
memory |
Cache backend: memory or redis |
--cache-redis-url |
redis://localhost:6379 |
Default Redis URL |
--cache-redis-prefix |
paper-weaver-cache |
Redis key prefix |
Neo4j Options
| Option | Default | Description |
|---|---|---|
--datadst-neo4j-uri |
bolt://localhost:7687 |
Neo4j connection URI |
--datadst-neo4j-user |
neo4j |
Neo4j username |
--datadst-neo4j-password |
neo4j |
Neo4j password |
--datadst-neo4j-database |
neo4j |
Neo4j database name |
Using with Redis Cache
For large-scale crawling, use Redis for persistent caching:
paper-weaver \
--init-mode authors \
--init-dblp-pids h/KaimingHe \
--cache-mode redis \
--cache-redis-url redis://localhost:6379 \
--datasrc-cache-mode redis \
--datasrc-redis-url redis://localhost:6379 \
--datadst-neo4j-uri bolt://localhost:7687 \
-v
Graph Schema
PaperWeaver creates the following nodes and relationships in Neo4j:
Nodes
- Paper:
title,year,venue,doi, etc. - Author:
name,pid,orcid, etc. - Venue:
name,type(journal/proceedings/book)
Relationships
(Author)-[:AUTHORED]->(Paper)(Paper)-[:PUBLISHED_IN]->(Venue)(Paper)-[:CITES]->(Paper)(Paper)-[:REFERENCES]->(Paper)
Python API
import asyncio
from paper_weaver import Author2Paper2VenueWeaver
from paper_weaver.datasrc.dblp import DBLPDataSrc
from paper_weaver.datadst.neo4j import Neo4jDataDst
from paper_weaver.cache import HybridCacheBuilder
from paper_weaver.initializer.dblp import DBLPAuthorsInitializer
async def main():
# Setup components
datasrc = DBLPDataSrc()
cache = HybridCacheBuilder().with_all_memory().build_weaver_cache()
initializer = DBLPAuthorsInitializer(["h/KaimingHe"])
# Setup Neo4j
from neo4j import AsyncGraphDatabase
driver = AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
session = driver.session(database="neo4j")
datadst = Neo4jDataDst(session)
# Create weaver and run
weaver = Author2Paper2VenueWeaver(
src=datasrc,
dst=datadst,
cache=cache,
initializer=initializer
)
total = await weaver.bfs(max_iterations=10)
print(f"Processed {total} items")
await driver.close()
asyncio.run(main())
Requirements
- Python 3.10+
- Neo4j 4.0+ (for graph storage)
- Redis (optional, for distributed caching)
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper_weaver-1.1.1.tar.gz.
File metadata
- Download URL: paper_weaver-1.1.1.tar.gz
- Upload date:
- Size: 50.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5cc5fe321cfbe9a696b2166961efc75fe5bfed8b139c9f0c6abe02bc6126581
|
|
| MD5 |
0551d5a42674e01c8a282822c2761a23
|
|
| BLAKE2b-256 |
3a5f936d7ab33302303b9651f237b92a0737edbe3eff27a6c41f95669531896f
|
File details
Details for the file paper_weaver-1.1.1-py3-none-any.whl.
File metadata
- Download URL: paper_weaver-1.1.1-py3-none-any.whl
- Upload date:
- Size: 79.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bba3f3f9c6032dcaeaafcd9df6383635550eb43d567b2f5c51e99c192b2abdd8
|
|
| MD5 |
e310ef6e642a1b5cb8fe5c5cc65c5392
|
|
| BLAKE2b-256 |
d694407ef26387e91f698d5b297c9335f08fcb7a06e4d42825299979d8dbc236
|