A library for robust crawler based on proxy pool and token bucket, support browser and requests
Project description
RobustCrawl
A library for robust cralwer based on proxy pool and token bucket, support browser and requests
Install
pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"
export OPENAI_API_BASE="your base" # optional
brew install go
brew install mihomo
set imported proxy file (.yml / .yaml) in ./config
Config
save it in ./config/robust_crawl_config.json
{
"wait_sec": 10,
"max_concurrent_requests": 500,
"GPT": {
"model_type": "gpt-3.5-turbo"
},
"TokenBucket": {
"tokens_per_minute": 20,
"bucket_capacity": 5,
"url_specific_tokens": {
"export.arxiv": {
"tokens_per_minute": 19,
"bucket_capacity": 1
}
}
},
"Proxy": {
"is_enabled": true,
"start_port": 33333,
"proxies_dir": "./proxies"
},
"ContextPool": {
"num_contexts": 2,
"work_contexts": 15,
"have_proxy": true,
"duplicate_proxies": false,
"ensure_none_proxies": true,
"download_pdf": false,
"downloads_path": "./output/browser_downloads",
"preference_path": "./output/broswer_config",
"context_lifetime": 60,
"context_cooling_time": 1
}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robust_crawl-0.1.1.tar.gz
(14.1 kB
view details)
Built Distribution
File details
Details for the file robust_crawl-0.1.1.tar.gz
.
File metadata
- Download URL: robust_crawl-0.1.1.tar.gz
- Upload date:
- Size: 14.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68f353815240dad7e38a5d2006fc74013f7c90da30706fa223ed78ef31812125 |
|
MD5 | 9174ce8e82eb9598602abed56223fa04 |
|
BLAKE2b-256 | 1f75507848f6cff462b7e206609db06b6cb505785129c13f7fac7b3a64160ab3 |
File details
Details for the file robust_crawl-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: robust_crawl-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1815935c4700244034ddd2729aa199a0272cef53e5b4f09a7ffaf47e81697134 |
|
MD5 | bca85409e5c715e11bf93ce89eb43404 |
|
BLAKE2b-256 | d18552e48e87623fa512f4d9f37d1a17d4bcc0824ef5c6e6a631cd606ad3a16a |