Skip to main content

A professional crawler for American Chemical Society papers with modern web dashboard

Project description

ACS Paper Crawler / ACS 论文爬虫

Python Version FastAPI License Documentation

A professional web-based crawler for American Chemical Society (ACS) papers with modern dashboard and analytics.

专业的 ACS(美国化学会)论文网络爬虫,具有现代化仪表板和分析功能。

English | 中文 | 📚 Documentation


English

Features

  • 43 Built-in Journals: Pre-configured ACS journal list
  • Real-time Crawling: Extract papers from ACS Publications
  • Complete Metadata: Title, DOI, authors, abstract, keywords, citation info
  • Modern Dashboard: Interactive charts and statistics
  • Advanced Filtering: Search by title, author, journal, year
  • Background Jobs: Async crawling with progress tracking
  • RESTful API: Full API documentation at /docs

Quick Start

Option 1: Docker (Recommended)

# Start with Docker Compose
docker compose up -d

# Access at http://localhost:8000

# Stop
docker compose down

Option 2: Local Installation

# Install dependencies
pip install -r requirements.txt

# (Optional) Configure ChromeDriver path
# Copy .env.example to .env and set CHROMEDRIVER_PATH if needed
cp .env.example .env
# Edit .env to set your ChromeDriver path (Windows users especially)

# Run the application
python run.py

# Open browser
http://localhost:8000

Option 3: Install from PyPI

pip install acs_crawler

# Run the web interface
python -m uvicorn acs_crawler.api.main:app --host 0.0.0.0 --port 8000

Requirements

  • Docker: 20.10+ (for Docker installation), OR
  • Python: 3.9+ (for local installation)
  • Chrome browser: Latest stable version
  • ChromeDriver: Auto-downloaded by webdriver-manager (or configure manually)

Configuration

ChromeDriver Path (Optional)

By default, ChromeDriver is automatically downloaded. If you want to use your own ChromeDriver:

  1. Copy .env.example to .env
  2. Set CHROMEDRIVER_PATH to your ChromeDriver executable path

Examples:

# Windows
CHROMEDRIVER_PATH=C:\Program Files\Google\Chrome\Application\chromedriver-win64\chromedriver.exe

# Linux/Mac
CHROMEDRIVER_PATH=/usr/local/bin/chromedriver

# WSL (Windows path from WSL)
CHROMEDRIVER_PATH=/mnt/c/Program Files/Google/Chrome/Application/chromedriver-win64/chromedriver.exe

Known Limitations

  • No Search URL Crawling: ACS search pages are protected by Cloudflare Turnstile CAPTCHA
    • Automated tools (Selenium, curl, etc.) are blocked
    • Workaround: Use journal issue URLs which work perfectly
    • Local filtering available in Papers UI after crawling
  • Performance: Selenium-based (slower than HTTP-only crawlers, ~3-5s startup per job)
  • Rate Limiting: No automatic limits - space out jobs manually (1-2 concurrent max)
  • Data Extraction: Only public metadata (no paywalled content, no author affiliations)
  • Scalability: Sequential job processing, SQLite storage (not for production)
  • ACS Only: Designed for ACS journals, relies on current page structure
  • Legal: Users responsible for complying with ACS Terms of Service

See full documentation for workarounds and best practices.

Documentation

Full documentation available in the docs/ directory:

cd docs
make html
# Open docs/_build/html/index.html

Or read online: Documentation

Screenshots

Dashboard Dashboard with statistics and charts

Papers Advanced paper filtering

Paper Detail Detailed paper view

Jobs Job management with cancellation

License & Copyright

Copyright (c) 2025 ACS Paper Crawler Contributors

This software is for educational and research purposes only.

  • ✅ Academic & Educational Use
  • ✅ Research & Study
  • ❌ Commercial Use (requires permission)
  • ⚠️ Respect ACS Terms of Service

See LICENSE and full documentation for details.


中文

功能特性

  • 43 个内置期刊:预配置的 ACS 期刊列表
  • 实时爬取:从 ACS Publications 提取论文
  • 完整元数据:标题、DOI、作者、摘要、关键词、引用信息
  • 现代化仪表板:交互式图表和统计
  • 高级过滤:按标题、作者、期刊、年份搜索
  • 后台任务:异步爬取,进度追踪
  • RESTful API:完整 API 文档位于 /docs

快速开始

方式一:Docker(推荐)

# 使用 Docker Compose 启动
docker compose up -d

# 访问 http://localhost:8000

# 停止
docker compose down

方式二:本地安装

# 安装依赖
pip install -r requirements.txt

# 运行应用
python run.py

# 打开浏览器
http://localhost:8000

环境要求

  • Docker: 20.10+(Docker 安装方式),或
  • Python: 3.9+(本地安装方式)
  • Chrome 浏览器: 最新稳定版
  • ChromeDriver: 由 webdriver-manager 自动下载

已知限制

  • 无法爬取搜索 URL:ACS 搜索页面受 Cloudflare Turnstile 验证码保护
    • 自动化工具(Selenium、curl 等)被阻止
    • 解决方法:使用期刊页面 URL,完美工作
    • 爬取后可在论文界面进行本地过滤
  • 性能:基于 Selenium(比纯 HTTP 爬虫慢,每个任务启动约 3-5 秒)
  • 速率限制:无自动限制 - 需手动间隔任务(最多 1-2 个并发)
  • 数据提取:仅公开元数据(无付费内容,无作者单位)
  • 可扩展性:顺序任务处理,SQLite 存储(不适用于生产环境)
  • 仅限 ACS:专为 ACS 期刊设计,依赖当前页面结构
  • 法律:用户需自行遵守 ACS 服务条款

详见完整文档获取解决方法和最佳实践。

文档

完整文档位于 docs/ 目录:

cd docs
make html
# 打开 docs/_build/html/index.html

或在线阅读:文档

截图

仪表板 带统计和图表的仪表板

论文 高级论文过滤

论文详情 详细的论文视图

任务 带取消功能的任务管理

许可证与版权

版权所有 (c) 2025 ACS Paper Crawler 贡献者

本软件仅用于教育和研究目的

  • ✅ 学术与教育用途
  • ✅ 研究与学习
  • ❌ 商业用途(需要许可)
  • ⚠️ 遵守 ACS 服务条款

详见许可证完整文档


Project Structure / 项目结构

ACS_crawler/
├── src/acs_crawler/      # Source code / 源代码
├── docs/                 # Documentation / 文档
├── data/                 # Database / 数据库
├── logs/                 # Logs / 日志
├── run.py               # Entry point / 入口
└── README.md            # This file / 本文件

Technology Stack / 技术栈

Backend: FastAPI, SQLite, Selenium, BeautifulSoup4 Frontend: Bootstrap 5, Chart.js, Vanilla JavaScript


Contributing / 贡献

Contributions welcome! Please see CONTRIBUTING.md

欢迎贡献!请查看贡献指南

Support / 支持


Happy Crawling! / 爬取愉快! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acs_crawler-0.1.3.tar.gz (59.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acs_crawler-0.1.3-py3-none-any.whl (64.3 kB view details)

Uploaded Python 3

File details

Details for the file acs_crawler-0.1.3.tar.gz.

File metadata

  • Download URL: acs_crawler-0.1.3.tar.gz
  • Upload date:
  • Size: 59.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for acs_crawler-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c5a732cd7de205a9dd80159b7b3007b642b1f8da904837cc0df600e85e3bb1e6
MD5 b9f8c13abb59792c43e10abbef5b19b5
BLAKE2b-256 e43729e2f71ad748dd5a41a5cbd7beb07ad329702dc96479b385f9f9cc914d6b

See more details on using hashes here.

File details

Details for the file acs_crawler-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: acs_crawler-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 64.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for acs_crawler-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0039964756f1c0fbe5a4b8b2caf5d4439a22bba5dff5437ff2324a9262cb255f
MD5 b26a7f1438facf6933afe0a7e56a9d0d
BLAKE2b-256 3884622031dd9456e5c2c927757588405a0f65a28ed234db53849b899496e512

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page