Skip to main content

A graph interface for academic tasks

Project description

🎡 ResearchArcade: Graph Interface for Academic Tasks

🌐 Project Page | 📜 arXiv | 🔗 Dataset

🗞️ News

  • Nov 27, 2025 — Our paper is out on arXiv (2511.22036)

🧭 Motivation

Academic data is distributed across multiple platforms (e.g., ArXiv, OpenReview) and modalities (text, figures, tables, reviews). ResearchArcade unifies these heterogeneous data sources into a single graph-based interface to enable large-scale, structured, and temporal analysis of academic dataset.

Core Features

  • Multi-Source: ArXiv (Academic Corpora) & OpenReview (Peer Reviews and Manuscript Revisions)
  • Multi-Modal: Figures and Tables in Academic Corpora
  • Highly Structural and Heterogeneous: Data can be intuitively viewed as heterogeneous graphs with multi-table format
  • Dynamically Evolving: Manuscript (Intra-paper) Level (e.g., Paper Revision) & Community (Inter-paper) Level (e.g., Paper Citation with Timestamp)
  • Highly Scalable: Graph is readily extensible as new items can be added by simply appending a row to the table

Data Illustration

data_description

Tables are classified into node tables (colored) or edge tables (black and white). The blue (denoting the OpenReview part) or red (denoting the ArXiv part) columns represent the unique identification of each node or edge, and the remaining columns represent the features of the nodes or edges. The conversion from the multiple tables to heterogeneous graphs is straightforward.

🚀 Get started

Supported Features

  • Dual Backend Support: CSV backend & PostgreSQL backend
  • Comprehensive Data
    • OpenReview: Support for papers, authors, reviews, revisions, paragraphs, and their interconnections
    • ArXiv: Support for papers, authors, paragraphs, sections, figures, tables and their interconnections
  • Flexible Data Import: Load data from OpenReview API, Arxiv API, CSV files, or JSON files
  • Flexible Data Output: The output data is in the format of pd.Dataframe, and they can be conveniently converted into CSV or JSON files.
  • Graph-like Operations: Navigate relationships between entities
    • OpenReview: authorship (paper-author), comment-under-paper (paper-review), revision-of-paper (paper-revision), revision-caused-by-review (revision-review), etc.
    • ArXiv: citationship (paper-paper), authorship (paper-author), paragraph-of-paper (paper-paragraph), figure-of-paper (paper-figure), table-of-paper (paper-table), etc.
  • CRUD Operations: Full support for Create, Read, Update, and Delete operations on all entities
  • Continuous Crawling: Automatically crawls newly updated arXiv data and integrates it into the graph

Setup

1. Environment Setup

  • Python ≥ 3.9 (tested on 3.10)
  • PostgreSQL ≥ 14 (for SQL backend)
  • Conda ≥ 22.0 (recommended)
  • API keys:
    • Semantic Scholar API
Python Setup
# create a new environment
conda create -n research_arcade python=3.10
conda activate research_arcade

# install related libraries
pip install -r requirements.txt
PostgreSQL Setup
# Download Source File
wget https://ftp.postgresql.org/pub/source/v16.2/postgresql-16.2.tar.gz
tar -xvzf postgresql-16.2.tar.gz
cd postgresql-16.2

# Set Installation Path
export INSTALL_DIR=/YOUR/INSTALL/DICT
mkdir -p $INSTALL_DIR

# Compile and Install
./configure --prefix=$INSTALL_DIR --without-icu --without-readline
make
make install

# Add PostgreSQL to PATH
export PATH=$INSTALL_DIR/bin:$PATH

# Set the Data Directory
export PGDATA=/YOUR/DATA/DICT
mkdir -p $PGDATA

# Initialize Database
### WARNING: Initialize Again will Clean the Database ####
initdb -D $PGDATA

# Launch Database
pg_ctl -D $PGDATA -l logfile start

# Create Database
createdb iclr_openreview_database
psql iclr_openreview_database

# Configure PostgreSQL to Python Access (Enable TCP Listening)
nano $PGDATA/postgresql.conf
### add at the end of the file ###
listen_addresses = 'localhost'
port = 5432
### add at the end of the file ###

# Allow TCP Connection Authentication
nano $PGDATA/pg_hba.conf
### add at the end of the file ###
# Allow local TCP connections to use md5 password authentication
host    all             all             127.0.0.1/32            md5
### add at the end of the file ###

# Restart Database when it Lost Connection
pg_ctl -D $PGDATA restart

2. Configure Environment Variables

To run the code, you’ll need to set up environment variables such as your Semantic Scholar API key and Database Configurations.

Copy the template file into the project root directory:

cp .env.template .env

3. Backend Selection

Initialize with CSV Backend
from research_arcade import ResearchArcade

research_arcade = ResearchArcade(
    db_type="csv",
    config={"csv_dir": "/path/to/csv/data/"}
)
Initialize with SQL Backend
from research_arcade import ResearchArcade

research_arcade = ResearchArcade(
    db_type="sql",
    config={
        "host": "localhost",
        "dbname": "conference_db",
        "user": "username",
        "password": "password",
        "port": "5432"
    }
)

Core Operations

The following examples demonstrate the core operations available in ResearchArcade. For comprehensive examples covering all supported tables and operations, please refer to the examples/tutorials.ipynb file in the repository.

Table Construction

# From API
config = {"venue": "ICLR.cc/2025/Conference"}
research_arcade.construct_table_from_api("openreview_papers", config)

# From CSV file
config = {"csv_file": "/path/to/papers.csv"}
research_arcade.construct_table_from_csv("openreview_papers", config)

# From JSON file
config = {"json_file": "/path/to/papers.json"}
research_arcade.construct_table_from_json("openreview_papers", config)

Query Operations

# Get all entities
papers_df = research_arcade.get_all_node_features("openreview_papers")

# Get specific entity by ID
paper_id = {"paper_openreview_id": "zGej22CBnS"}
paper = research_arcade.get_node_features_by_id("openreview_papers", paper_id)

# Get relationships
paper_authors = research_arcade.get_neighborhood("openreview_papers_authors", paper_id)

Node Manipulation

# Insert new node
new_author = {
    'venue': 'ICLR.cc/2025/Conference',
    'author_openreview_id': '~john_doe1',
    'author_full_name': 'John Doe',
    'email': 'john@university.edu',
    'affiliation': 'University Name'
}
research_arcade.insert_node("openreview_authors", node_features=new_author)

# Update existing node
updated_paper = {
    'paper_openreview_id': 'paper123',
    'title': 'Updated Title',
    # ... other fields
}
research_arcade.update_node("openreview_papers", node_features=updated_paper)

# Delete a node
review_id = {"review_openreview_id": "review456"}
research_arcade.delete_node_by_id("openreview_reviews", review_id)

Edge Manipulation

# Create an edge
paper_author_edge = {
    'venue': 'ICLR.cc/2025/Conference',
    'paper_openreview_id': 'paper123',
    'author_openreview_id': '~john_doe1'
}
research_arcade.insert_edge("openreview_papers_authors", paper_author_edge)

# Delete an edge
research_arcade.delete_edge_by_id("openreview_papers_authors", paper_author_edge)

Continuous Crawling

research_arcade.continuous_crawling(interval_days=2, delay_days=2, paper_category='All', dest_dir="./download", arxiv_id_dest="./data")

Contribution

We’re working on extending support for data and operations. Contributions welcome!

Acknowledgements

This project builds on open academic infrastructures such as ArXiv and OpenReview.

License

This project is licensed under the MIT License – see the LICENSE file for details.

Citation

@misc{tinyscientist,
author       = {Jingjun Xu and Chongshan Lin and Haofei Yu and Tao Feng and Jiaxuan You},
title        = {ResearchArcade: Graph Interface for Academic Tasks},
howpublished = {https://github.com/ulab-uiuc/research-arcade},
note         = {Accessed: 2025-12-25},
year         = {2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

research_arcade-0.1.7.tar.gz (140.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

research_arcade-0.1.7-py3-none-any.whl (236.2 kB view details)

Uploaded Python 3

File details

Details for the file research_arcade-0.1.7.tar.gz.

File metadata

  • Download URL: research_arcade-0.1.7.tar.gz
  • Upload date:
  • Size: 140.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.7 Linux/6.8.0-85-generic

File hashes

Hashes for research_arcade-0.1.7.tar.gz
Algorithm Hash digest
SHA256 0d26de51d36eb23aa19a1fef9c5c75cced02a9df950f0d374507fd29b22e9ac9
MD5 e9048984c1f23ba5100929983a7f4a7a
BLAKE2b-256 9c8fd357f57f2f8e5d10be508c45e96f5887614ab972a00f37be9aef35a18dc9

See more details on using hashes here.

File details

Details for the file research_arcade-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: research_arcade-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 236.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.7 Linux/6.8.0-85-generic

File hashes

Hashes for research_arcade-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 70eb53be4aea23703071ddfe3bab2afa02afcdb18e82e04d335613057983610b
MD5 723e415dcbb118464e6dd92a02e4355e
BLAKE2b-256 3f7d77aa5c311021aca90216090062edd6b0f5898abf51b3b876a8a041db21d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page