A professional crawler for American Chemical Society papers with modern web dashboard

These details have not been verified by PyPI

Project links

Project description

ACS Paper Crawler / ACS 论文爬虫

A professional web-based crawler for American Chemical Society (ACS) papers with modern dashboard and analytics.

专业的 ACS（美国化学会）论文网络爬虫，具有现代化仪表板和分析功能。

English | 中文 | 📚 Documentation

English

Features

43 Built-in Journals: Pre-configured ACS journal list
Real-time Crawling: Extract papers from ACS Publications
Complete Metadata: Title, DOI, authors, abstract, keywords, citation info
Modern Dashboard: Interactive charts and statistics
Advanced Filtering: Search by title, author, journal, year
Background Jobs: Async crawling with progress tracking
RESTful API: Full API documentation at /docs

Quick Start

Option 1: Docker (Recommended)

# Start with Docker Compose
docker compose up -d

# Access at http://localhost:8000

# Stop
docker compose down

Option 2: Local Installation

# Install dependencies
pip install -r requirements.txt

# (Optional) Configure ChromeDriver path
# Copy .env.example to .env and set CHROMEDRIVER_PATH if needed
cp .env.example .env
# Edit .env to set your ChromeDriver path (Windows users especially)

# Run the application
python run.py

# Open browser
http://localhost:8000

Option 3: Install from PyPI (Coming soon)

pip install acs-crawler

# Run the web interface
python -m uvicorn acs_crawler.api.main:app --host 0.0.0.0 --port 8000

Requirements

Docker: 20.10+ (for Docker installation), OR
Python: 3.9+ (for local installation)
Chrome browser: Latest stable version
ChromeDriver: Auto-downloaded by webdriver-manager (or configure manually)

Configuration

ChromeDriver Path (Optional)

By default, ChromeDriver is automatically downloaded. If you want to use your own ChromeDriver:

Copy .env.example to .env
Set CHROMEDRIVER_PATH to your ChromeDriver executable path

Examples:

# Windows
CHROMEDRIVER_PATH=C:\Program Files\Google\Chrome\Application\chromedriver-win64\chromedriver.exe

# Linux/Mac
CHROMEDRIVER_PATH=/usr/local/bin/chromedriver

# WSL (Windows path from WSL)
CHROMEDRIVER_PATH=/mnt/c/Program Files/Google/Chrome/Application/chromedriver-win64/chromedriver.exe

Known Limitations

No Search URL Crawling: ACS search pages are protected by Cloudflare Turnstile CAPTCHA
- Automated tools (Selenium, curl, etc.) are blocked
- Workaround: Use journal issue URLs which work perfectly
- Local filtering available in Papers UI after crawling
Performance: Selenium-based (slower than HTTP-only crawlers, ~3-5s startup per job)
Rate Limiting: No automatic limits - space out jobs manually (1-2 concurrent max)
Data Extraction: Only public metadata (no paywalled content, no author affiliations)
Scalability: Sequential job processing, SQLite storage (not for production)
ACS Only: Designed for ACS journals, relies on current page structure
Legal: Users responsible for complying with ACS Terms of Service

See full documentation for workarounds and best practices.

Documentation

Full documentation available in the docs/ directory:

cd docs
make html
# Open docs/_build/html/index.html

Or read online: Documentation

Screenshots

Dashboard with statistics and charts

Papers Advanced paper filtering

Paper Detail Detailed paper view

Jobs Job management with cancellation

License & Copyright

This software is for educational and research purposes only.

✅ Academic & Educational Use
✅ Research & Study
❌ Commercial Use (requires permission)
⚠️ Respect ACS Terms of Service

See LICENSE and full documentation for details.

中文

功能特性

43 个内置期刊：预配置的 ACS 期刊列表
实时爬取：从 ACS Publications 提取论文
完整元数据：标题、DOI、作者、摘要、关键词、引用信息
现代化仪表板：交互式图表和统计
高级过滤：按标题、作者、期刊、年份搜索
后台任务：异步爬取，进度追踪
RESTful API：完整 API 文档位于 /docs

快速开始

方式一：Docker（推荐）

# 使用 Docker Compose 启动
docker compose up -d

# 访问 http://localhost:8000

# 停止
docker compose down

方式二：本地安装

# 安装依赖
pip install -r requirements.txt

# 运行应用
python run.py

# 打开浏览器
http://localhost:8000

环境要求

Docker: 20.10+（Docker 安装方式），或
Python: 3.9+（本地安装方式）
Chrome 浏览器: 最新稳定版
ChromeDriver: 由 webdriver-manager 自动下载

已知限制

无法爬取搜索 URL：ACS 搜索页面受 Cloudflare Turnstile 验证码保护
- 自动化工具（Selenium、curl 等）被阻止
- 解决方法：使用期刊页面 URL，完美工作
- 爬取后可在论文界面进行本地过滤
性能：基于 Selenium（比纯 HTTP 爬虫慢，每个任务启动约 3-5 秒）
速率限制：无自动限制 - 需手动间隔任务（最多 1-2 个并发）
数据提取：仅公开元数据（无付费内容，无作者单位）
可扩展性：顺序任务处理，SQLite 存储（不适用于生产环境）
仅限 ACS：专为 ACS 期刊设计，依赖当前页面结构
法律：用户需自行遵守 ACS 服务条款

详见完整文档获取解决方法和最佳实践。

文档

完整文档位于 docs/ 目录：

cd docs
make html
# 打开 docs/_build/html/index.html

或在线阅读：文档

截图

带统计和图表的仪表板

高级论文过滤

论文详情 详细的论文视图

带取消功能的任务管理

许可证与版权

本软件仅用于教育和研究目的。

✅ 学术与教育用途
✅ 研究与学习
❌ 商业用途（需要许可）
⚠️ 遵守 ACS 服务条款

详见许可证和完整文档。

Project Structure / 项目结构

ACS_crawler/
├── src/acs_crawler/      # Source code / 源代码
├── docs/                 # Documentation / 文档
├── data/                 # Database / 数据库
├── logs/                 # Logs / 日志
├── run.py               # Entry point / 入口
└── README.md            # This file / 本文件

Technology Stack / 技术栈

Backend: FastAPI, SQLite, Selenium, BeautifulSoup4 Frontend: Bootstrap 5, Chart.js, Vanilla JavaScript

Contributing / 贡献

Contributions welcome! Please see CONTRIBUTING.md

欢迎贡献！请查看贡献指南

Support / 支持

Happy Crawling! / 爬取愉快！ 🚀

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Oct 21, 2025

0.1.2

Oct 21, 2025

This version

0.1.1

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acs-crawler-0.1.1.tar.gz (58.7 kB view details)

Uploaded Oct 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

acs_crawler-0.1.1-py3-none-any.whl (64.3 kB view details)

Uploaded Oct 21, 2025 Python 3

File details

Details for the file acs-crawler-0.1.1.tar.gz.

File metadata

Download URL: acs-crawler-0.1.1.tar.gz
Upload date: Oct 21, 2025
Size: 58.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for acs-crawler-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`80850badef68877409c21a663becbc28d817a49bcca3f1b5e23cd467541a0044`
MD5	`a7745f4a892d67dfc4015c2e29a917d5`
BLAKE2b-256	`784b996370c187327d51a0c66717413bfc4c0e273cd6df19b5ba74a5d21f7a97`

See more details on using hashes here.

File details

Details for the file acs_crawler-0.1.1-py3-none-any.whl.

File metadata

Download URL: acs_crawler-0.1.1-py3-none-any.whl
Upload date: Oct 21, 2025
Size: 64.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for acs_crawler-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd73c0cd9bbbfb140508ed6e1650d81a2be7310e4ae0c02c9a65e326a375c00c`
MD5	`1c89a0e24bdc0cf62ccae3ea343f862e`
BLAKE2b-256	`761ea12341f5534d43363089258556587d6feee9a053cb8a9809d6bddb97fe08`

See more details on using hashes here.

acs-crawler 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ACS Paper Crawler / ACS 论文爬虫

English

Features

Quick Start

Requirements

Configuration

Known Limitations

Documentation

Screenshots

License & Copyright

中文

功能特性

快速开始

环境要求

已知限制

文档

截图

许可证与版权

Project Structure / 项目结构

Technology Stack / 技术栈

Contributing / 贡献

Support / 支持

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes