Automate arXiv paper tracking with LLM-powered metadata extraction and Google Sheets sync.
Project description
arXivFlow 🚀
arXivFlow is a powerful Python-based automation tool designed to streamline the research paper discovery and tracking process. It autonomously fetches metadata from arXiv, performs local AI-driven analysis using Ollama (Llama 3.2), and synchronizes the results with Google Sheets and local databases.
✨ Features
- Automated Retrieval: Fetch the latest papers from specific arXiv categories (e.g.,
cs.AI,cs.LG,hep-ph) within any date range. - Local AI Analysis: Uses Ollama (Llama 3.2) to extract keywords and contact information (emails/affiliations) directly from PDF text. No cloud API costs or data privacy concerns.
- Intelligent PDF Handling: Automatically downloads PDFs and extracts text for deep analysis. Supports custom storage paths.
- Multi-Format Export: Save your research data to CSV, JSON, Excel, or SQLite for flexible offline analysis.
- Google Sheets Sync: Seamlessly push compiled research data to a shared Google Sheet for team collaboration.
- Type-Safe & Modular: Clean, documented Python code with full type hinting and a class-based architecture.
🛠️ Prerequisites
- Python 3.13+: Ensure you have a modern Python environment.
- Ollama: Install Ollama and download the required model:
ollama pull llama3.2
- Google Cloud Credentials:
- Enable the Google Sheets and Google Drive APIs.
- Create a Service Account and download the JSON key as
credentials.json. - Ensure the service account has 'Editor' permissions on the sheet.
🚀 Installation
From PyPI (Recommended)
pip install arxivflow
From Source (For Development)
-
Clone the repository:
git clone https://github.com/zjzhao/arXivFlow.git cd arXivFlow
-
Set up virtual environment:
python -m venv . source bin/activate # On Windows: Scripts\activate
-
Install dependencies:
pip install -e .
📖 Usage
Quick Start
from arxivflow import arXivFlow
import datetime
# 1. Initialize the flow
flow = arXivFlow(
categories=["cs.AI", "cs.CV"],
ollama_model="llama3.2",
max_results=20,
start_date=datetime.datetime.now() - datetime.timedelta(days=7)
)
# 2. Fetch data & Extract info (Keywords/Contacts)
df = flow.get_arxiv_data(download_pdfs=True)
# 3. Save to your preferred formats
flow.save_to_csv("my_research.csv")
flow.save_to_sqlite("research.db")
# 4. Sync with Google Sheets
flow.save_to_google_sheet(
sheet_id="YOUR_SHEET_ID",
credentials_file="credentials.json"
)
🏗️ Architecture
The project follows a modular structure for easy extension:
src/arxivflow/arxivflow.py: The main orchestrator class (arXivFlow).src/arxivflow/ollama_functions.py: Local LLM interface using the Ollama API.
📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
🤝 Contributing
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arxivflow-0.1.1.tar.gz.
File metadata
- Download URL: arxivflow-0.1.1.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96bae5a4aeb05430edf50d846927392ae15e408be2f52e65448d164c03f3d00c
|
|
| MD5 |
1db4cec268c385a416d015f3c0e3d657
|
|
| BLAKE2b-256 |
d08b4edfa2452f9db8a535e9c842fc8859da6905a022c7cadd23e18b10eefc2f
|
Provenance
The following attestation bundles were made for arxivflow-0.1.1.tar.gz:
Publisher:
python-publish.yml on zjzhao1002/arXivFlow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arxivflow-0.1.1.tar.gz -
Subject digest:
96bae5a4aeb05430edf50d846927392ae15e408be2f52e65448d164c03f3d00c - Sigstore transparency entry: 1418193053
- Sigstore integration time:
-
Permalink:
zjzhao1002/arXivFlow@003dd18b47d815838e49e811161dd35a94bb6598 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/zjzhao1002
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@003dd18b47d815838e49e811161dd35a94bb6598 -
Trigger Event:
release
-
Statement type:
File details
Details for the file arxivflow-0.1.1-py3-none-any.whl.
File metadata
- Download URL: arxivflow-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99a72afb81b8be0cd045b10186bfdb43fd9f699416713de5a3db39e241d43836
|
|
| MD5 |
27d903f70fc8cdd92b03fb5013f86e21
|
|
| BLAKE2b-256 |
b4a7f2b7fe5b092f9d34c4099da81b2d12d6d78b40d25808c0df9e674fc4dd31
|
Provenance
The following attestation bundles were made for arxivflow-0.1.1-py3-none-any.whl:
Publisher:
python-publish.yml on zjzhao1002/arXivFlow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arxivflow-0.1.1-py3-none-any.whl -
Subject digest:
99a72afb81b8be0cd045b10186bfdb43fd9f699416713de5a3db39e241d43836 - Sigstore transparency entry: 1418193082
- Sigstore integration time:
-
Permalink:
zjzhao1002/arXivFlow@003dd18b47d815838e49e811161dd35a94bb6598 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/zjzhao1002
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@003dd18b47d815838e49e811161dd35a94bb6598 -
Trigger Event:
release
-
Statement type: