A library for robust crawler based on proxy pool and token bucket, support browser and requests
Project description
RobustCrawl
A library for robust cralwer based on proxy pool and token bucket, support browser and requests
Install
pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"
export OPENAI_API_BASE="your base" # optional
brew install go
brew install mihomo
set imported proxy file (.yml / .yaml) in ./config
Config
save it in ./config/robust_crawl_config.json
{
"wait_sec": 10,
"max_concurrent_requests": 500,
"GPT": {
"model_type": "gpt-3.5-turbo"
},
"TokenBucket": {
"tokens_per_minute": 20,
"bucket_capacity": 5,
"url_specific_tokens": {
"export.arxiv": {
"tokens_per_minute": 19,
"bucket_capacity": 1
}
}
},
"Proxy": {
"is_enabled": true,
"start_port": 33333,
"proxies_dir": "./proxies"
},
"ContextPool": {
"num_contexts": 2,
"work_contexts": 15,
"have_proxy": true,
"duplicate_proxies": false,
"ensure_none_proxies": true,
"download_pdf": false,
"downloads_path": "./output/browser_downloads",
"preference_path": "./output/broswer_config",
"context_lifetime": 60,
"context_cooling_time": 1
}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robust_crawl-0.1.2.tar.gz
(14.1 kB
view details)
Built Distribution
File details
Details for the file robust_crawl-0.1.2.tar.gz
.
File metadata
- Download URL: robust_crawl-0.1.2.tar.gz
- Upload date:
- Size: 14.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fbb75b9c2e035dbea1b55fa6c8904ba77e66eb6a77a45b50d5c3bdffdc3839e4 |
|
MD5 | 32d12fbf5da68c2d7180016f1b7ed82d |
|
BLAKE2b-256 | efaa9292a9238353bc76829bed524eb22a7e63870cb2c1f5bb2d5d04ae8364ba |
File details
Details for the file robust_crawl-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: robust_crawl-0.1.2-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 979938eb11aecbb874cd6e3d5520d6225d9cec10b85768d8e8a4c8e42c4b94bc |
|
MD5 | 90b9a3e6c0195176c0a0899f510770de |
|
BLAKE2b-256 | 32c0a1f105327a941730b581c8bc875d743aa361088360f72590817c50737115 |