Automate arXiv paper tracking with LLM-powered metadata extraction and Google Sheets sync.
Project description
arXivFlow
arXivFlow is a Python-based automation tool designed to fetch research paper metadata from arXiv, extract keywords and contact information using local LLMs (Ollama), and synchronize the results with Google Sheets.
Project Overview
- Purpose: Automates the tracking and processing of new research papers. It fetches data for specified arXiv categories, uses Ollama (Llama 3.2) to summarize and extract keywords/contact info from PDFs, and uploads the compiled data to a Google Sheet.
- Main Technologies:
- Python 3.13: Core language.
- arxiv: Library for querying the arXiv API.
- Ollama (Llama 3.2): Local LLM for intelligent extraction.
- PyMuPDF: PDF text extraction for contact information retrieval.
- pandas: Data manipulation and export to CSV, Excel, JSON, and SQLite.
- gspread: Google Sheets API interaction.
Architecture & Key Files
The project follows a modular structure located in src/arxivflow/.
Core Modules
src/arxivflow/arxivflow.py: Contains thearXivFlowclass, which orchestrates the entire workflow:- Querying arXiv for specific categories and date ranges.
- Downloading PDFs to the
pdfs/directory. - Processing results and extracting information.
- Saving data to CSV, JSON, Excel, SQLite, or Google Sheets.
src/arxivflow/ollama_functions.py: Contains theOllamaFunctionsclass for interacting with the local Ollama API to extract keywords and contact details.
Configuration & Data
user_input.json: Configures the target Google Sheet ID, CSV filename, and credentials path.credentials.json: (User-provided) Google Service Account credentials.requirements.txt: Project dependencies.pdfs/: Local directory where downloaded research papers are stored.
Building and Running
Prerequisites
- Python 3.13+: Ensure Python is installed.
- Ollama: Install Ollama and pull the required model:
ollama pull llama3.2
- Google Cloud Setup:
- Enable Google Sheets and Google Drive APIs.
- Create a Service Account and save the JSON key as
credentials.json. - Share the target Google Sheet with the Service Account email.
Setup
- Create and activate a virtual environment:
python -m venv . source bin/activate # On Windows: Scripts\activate
- Install dependencies:
pip install -r requirements.txt
Usage
The arXivFlow class can be used as follows:
from arxivflow import arXivFlow
import datetime
# Initialize with categories and optional Ollama model
flow = arXivFlow(
categories=["cs.AI", "cs.LG"],
ollama_model="llama3.2",
max_results=50,
start_date=datetime.datetime.now() - datetime.timedelta(days=3)
)
# Optional: Set a custom path for PDF downloads
flow.set_pdfs_path("my_papers")
# Fetch data and optionally download PDFs for contact extraction
df = flow.get_arxiv_data(download_pdfs=True)
# Save to multiple formats
flow.save_to_csv("results.csv")
flow.save_to_json("results.json")
flow.save_to_excel("results.xlsx")
flow.save_to_sqlite("results.db")
# Sync with Google Sheets
flow.save_to_google_sheet(
sheet_id="YOUR_SHEET_ID",
credentials_file="credentials.json"
)
Development Conventions
- Modular Logic: All core functionality resides in
src/arxivflow/. - Local AI: Keyword and contact extraction are performed locally using Ollama to ensure privacy and eliminate API costs. The tool automatically handles model verification and pulling.
- Data Persistence: Supports multiple export formats (CSV, JSON, Excel, SQLite) for flexibility.
- Type Hinting: The codebase uses Python type hints for better maintainability and clarity.
- Configurable PDF Handling: PDFs can be optionally downloaded and stored in custom directories.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arxivflow-0.1.0.tar.gz.
File metadata
- Download URL: arxivflow-0.1.0.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83d57ff38552eeb95dc6935cbb598b0d59969eca495fa58596377d5bb546f27a
|
|
| MD5 |
d0d8d056e5078a91a6e61d74d62fadb1
|
|
| BLAKE2b-256 |
b5d409b3c889a6134cb05336d128b6c2776901af6ed7866da319ef9cdb3a4eba
|
Provenance
The following attestation bundles were made for arxivflow-0.1.0.tar.gz:
Publisher:
python-publish.yml on zjzhao1002/arXivFlow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arxivflow-0.1.0.tar.gz -
Subject digest:
83d57ff38552eeb95dc6935cbb598b0d59969eca495fa58596377d5bb546f27a - Sigstore transparency entry: 1417091796
- Sigstore integration time:
-
Permalink:
zjzhao1002/arXivFlow@d672f1d218ab3778b9e43c34c9c2876c6720a1b2 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/zjzhao1002
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@d672f1d218ab3778b9e43c34c9c2876c6720a1b2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file arxivflow-0.1.0-py3-none-any.whl.
File metadata
- Download URL: arxivflow-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89cf5b0ad3e631a9a9d03ccb03fc5cf05a7e95e877fcb20c38cf7a7de5de6717
|
|
| MD5 |
4f5fe96c52956e04ac28981952070463
|
|
| BLAKE2b-256 |
0dcf8668c1b9b35f5227cfb13f4b391536d10a927cf16405e42423ec90c7c361
|
Provenance
The following attestation bundles were made for arxivflow-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on zjzhao1002/arXivFlow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arxivflow-0.1.0-py3-none-any.whl -
Subject digest:
89cf5b0ad3e631a9a9d03ccb03fc5cf05a7e95e877fcb20c38cf7a7de5de6717 - Sigstore transparency entry: 1417091798
- Sigstore integration time:
-
Permalink:
zjzhao1002/arXivFlow@d672f1d218ab3778b9e43c34c9c2876c6720a1b2 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/zjzhao1002
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@d672f1d218ab3778b9e43c34c9c2876c6720a1b2 -
Trigger Event:
release
-
Statement type: