A library for robust crawler based on proxy pool and token bucket, support browser and requests
Project description
RobustCrawl
A library for robust cralwer based on proxy pool and token bucket, support browser and requests
Install
pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"
export OPENAI_API_BASE="your base" # optional
brew install go
brew install mihomo
set imported proxy file (.yml / .yaml) in ./config
Config
save it in ./config/robust_crawl_config.json
{
"wait_sec": 10,
"max_concurrent_requests": 500,
"GPT": {
"model_type": "gpt-3.5-turbo"
},
"TokenBucket": {
"tokens_per_minute": 20,
"bucket_capacity": 5,
"url_specific_tokens": {
"export.arxiv": {
"tokens_per_minute": 19,
"bucket_capacity": 1
}
}
},
"Proxy": {
"is_enabled": true,
"start_port": 33333,
"proxies_dir": "./proxies"
},
"ContextPool": {
"num_contexts": 2,
"work_contexts": 15,
"have_proxy": true,
"duplicate_proxies": false,
"ensure_none_proxies": true,
"download_pdf": false,
"downloads_path": "./output/browser_downloads",
"preference_path": "./output/broswer_config",
"context_lifetime": 60,
"context_cooling_time": 1
}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robust_crawl-0.1.4.tar.gz
(14.1 kB
view details)
Built Distribution
File details
Details for the file robust_crawl-0.1.4.tar.gz
.
File metadata
- Download URL: robust_crawl-0.1.4.tar.gz
- Upload date:
- Size: 14.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4d9669186a38dfbc29773a782b57fcafe0bd6de8a494f2e400367b9cd4fa1e8 |
|
MD5 | f8f52be11cb922808b2eb42e38b73567 |
|
BLAKE2b-256 | 81c4ef56c20a93573eea9a447bf84d22f13a01e434d458b9eac7f6a0d083edc7 |
File details
Details for the file robust_crawl-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: robust_crawl-0.1.4-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e209cb1ac2c10eb662acf1d2b87ec544f82f3cae4a4a709e273f89f0e4f9dcf3 |
|
MD5 | db8dd2597952af7b760b9c476da96f7d |
|
BLAKE2b-256 | 1c70590af9021d539c5cb36fe0e6f2a6168d52257a8f64f3c30f8f43f6afd2ce |