Skip to main content

Automate arXiv paper tracking with LLM-powered metadata extraction and Google Sheets sync.

Project description

arXivFlow

arXivFlow is a Python-based automation tool designed to fetch research paper metadata from arXiv, extract keywords and contact information using local LLMs (Ollama), and synchronize the results with Google Sheets.

Project Overview

  • Purpose: Automates the tracking and processing of new research papers. It fetches data for specified arXiv categories, uses Ollama (Llama 3.2) to summarize and extract keywords/contact info from PDFs, and uploads the compiled data to a Google Sheet.
  • Main Technologies:
    • Python 3.13: Core language.
    • arxiv: Library for querying the arXiv API.
    • Ollama (Llama 3.2): Local LLM for intelligent extraction.
    • PyMuPDF: PDF text extraction for contact information retrieval.
    • pandas: Data manipulation and export to CSV, Excel, JSON, and SQLite.
    • gspread: Google Sheets API interaction.

Architecture & Key Files

The project follows a modular structure located in src/arxivflow/.

Core Modules

  • src/arxivflow/arxivflow.py: Contains the arXivFlow class, which orchestrates the entire workflow:
    • Querying arXiv for specific categories and date ranges.
    • Downloading PDFs to the pdfs/ directory.
    • Processing results and extracting information.
    • Saving data to CSV, JSON, Excel, SQLite, or Google Sheets.
  • src/arxivflow/ollama_functions.py: Contains the OllamaFunctions class for interacting with the local Ollama API to extract keywords and contact details.

Configuration & Data

  • user_input.json: Configures the target Google Sheet ID, CSV filename, and credentials path.
  • credentials.json: (User-provided) Google Service Account credentials.
  • requirements.txt: Project dependencies.
  • pdfs/: Local directory where downloaded research papers are stored.

Building and Running

Prerequisites

  1. Python 3.13+: Ensure Python is installed.
  2. Ollama: Install Ollama and pull the required model:
    ollama pull llama3.2
    
  3. Google Cloud Setup:
    • Enable Google Sheets and Google Drive APIs.
    • Create a Service Account and save the JSON key as credentials.json.
    • Share the target Google Sheet with the Service Account email.

Setup

  1. Create and activate a virtual environment:
    python -m venv .
    source bin/activate  # On Windows: Scripts\activate
    
  2. Install dependencies:
    pip install -r requirements.txt
    

Usage

The arXivFlow class can be used as follows:

from arxivflow import arXivFlow
import datetime

# Initialize with categories and optional Ollama model
flow = arXivFlow(
    categories=["cs.AI", "cs.LG"], 
    ollama_model="llama3.2",
    max_results=50,
    start_date=datetime.datetime.now() - datetime.timedelta(days=3)
)

# Optional: Set a custom path for PDF downloads
flow.set_pdfs_path("my_papers")

# Fetch data and optionally download PDFs for contact extraction
df = flow.get_arxiv_data(download_pdfs=True)

# Save to multiple formats
flow.save_to_csv("results.csv")
flow.save_to_json("results.json")
flow.save_to_excel("results.xlsx")
flow.save_to_sqlite("results.db")

# Sync with Google Sheets
flow.save_to_google_sheet(
    sheet_id="YOUR_SHEET_ID", 
    credentials_file="credentials.json"
)

Development Conventions

  • Modular Logic: All core functionality resides in src/arxivflow/.
  • Local AI: Keyword and contact extraction are performed locally using Ollama to ensure privacy and eliminate API costs. The tool automatically handles model verification and pulling.
  • Data Persistence: Supports multiple export formats (CSV, JSON, Excel, SQLite) for flexibility.
  • Type Hinting: The codebase uses Python type hints for better maintainability and clarity.
  • Configurable PDF Handling: PDFs can be optionally downloaded and stored in custom directories.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxivflow-0.1.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arxivflow-0.1.0-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file arxivflow-0.1.0.tar.gz.

File metadata

  • Download URL: arxivflow-0.1.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivflow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 83d57ff38552eeb95dc6935cbb598b0d59969eca495fa58596377d5bb546f27a
MD5 d0d8d056e5078a91a6e61d74d62fadb1
BLAKE2b-256 b5d409b3c889a6134cb05336d128b6c2776901af6ed7866da319ef9cdb3a4eba

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivflow-0.1.0.tar.gz:

Publisher: python-publish.yml on zjzhao1002/arXivFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arxivflow-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: arxivflow-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivflow-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 89cf5b0ad3e631a9a9d03ccb03fc5cf05a7e95e877fcb20c38cf7a7de5de6717
MD5 4f5fe96c52956e04ac28981952070463
BLAKE2b-256 0dcf8668c1b9b35f5227cfb13f4b391536d10a927cf16405e42423ec90c7c361

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivflow-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on zjzhao1002/arXivFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page