A library for robust cralwer based on proxy pool and token bucket, support browser and requests
Project description
RobustCrawl
A library for robust cralwer based on proxy pool and token bucket, support browser and requests
Install
pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"
export OPENAI_API_BASE="your base" # optional
brew install go
brew install mihomo
set imported proxy file (.yml / .yaml) in ./config
Config
save it in ./config/robust_crawl_config.json
{
"max_concurrent_requests": 500,
"GPT": {
"model_type": "gpt-3.5-turbo"
},
"TokenBucket": {
"tokens_per_minute": 20,
"bucket_capacity": 5,
"url_specific_tokens": {
"export.arxiv": {
"tokens_per_minute": 19,
"bucket_capacity": 1
}
}
},
"Proxy": {
"is_enabled": true,
"core_type": "mihomo",
"start_port": 33333,
"config_paths": [
"the comparative path to the proxy file, imported by clash-verge core",
"./config/proxy.yaml"
]
},
"ContextPool": {
"num_contexts": 5,
"work_contexts": 15,
"have_proxy": true,
"duplicate_proxies": false,
"ensure_none_proxies": true,
"download_pdf": false,
"downloads_path": "./output/browser_downloads",
"preference_path": "./output/broswer_config",
"context_lifetime": 60,
"context_cooling_time":1
}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robust_crawl-0.1.tar.gz
(14.1 kB
view details)
Built Distribution
File details
Details for the file robust_crawl-0.1.tar.gz
.
File metadata
- Download URL: robust_crawl-0.1.tar.gz
- Upload date:
- Size: 14.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa5375ad049f202cbc79285816b29828e742ac71c97c3a49494f166ae62efd91 |
|
MD5 | 7871b6c7e0384bb3d8eafbf6d874a5d9 |
|
BLAKE2b-256 | 9c69d642bd93028dc680c418337d1e2a0bb2e3bf7fad1f6f50d7ff5fb396844a |
File details
Details for the file robust_crawl-0.1-py3-none-any.whl
.
File metadata
- Download URL: robust_crawl-0.1-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84896879337df0bd51a32874c22297b569a3085d74cce001a47743f1c6cecb78 |
|
MD5 | fecfe5ba2259e6617c1569502527ebe3 |
|
BLAKE2b-256 | 571f35266d7d003b930f0119a6a34f60ff6f925d11ddd18a733e5efd4822aef4 |