A library for robust crawler based on proxy pool and token bucket, support browser and requests
Project description
RobustCrawl
A library for robust cralwer based on proxy pool and token bucket, support browser and requests
Install
pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"
export OPENAI_API_BASE="your base" # optional
brew install go
brew install mihomo
set imported proxy file (.yml / .yaml) in ./config
Config
save it in ./config/robust_crawl_config.json
{
"wait_sec": 10,
"max_concurrent_requests": 500,
"GPT": {
"model_type": "gpt-3.5-turbo"
},
"TokenBucket": {
"tokens_per_minute": 20,
"bucket_capacity": 5,
"url_specific_tokens": {
"export.arxiv": {
"tokens_per_minute": 19,
"bucket_capacity": 1
}
}
},
"Proxy": {
"is_enabled": true,
"start_port": 33333,
"proxies_dir": "./proxies"
},
"ContextPool": {
"num_contexts": 2,
"work_contexts": 15,
"have_proxy": true,
"duplicate_proxies": false,
"ensure_none_proxies": true,
"download_pdf": false,
"downloads_path": "./output/browser_downloads",
"preference_path": "./output/broswer_config",
"context_lifetime": 60,
"context_cooling_time": 1
}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
robust_crawl-0.1.3.tar.gz
(14.1 kB
view details)
Built Distribution
File details
Details for the file robust_crawl-0.1.3.tar.gz
.
File metadata
- Download URL: robust_crawl-0.1.3.tar.gz
- Upload date:
- Size: 14.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16405ca6de682a0fe261345c6afe4701660978377844854a1e0ec249b9ace945 |
|
MD5 | bb971262585a875a49bc5b4a04dabba0 |
|
BLAKE2b-256 | 02809605493a761bcc9858f7f178838adb85e7b1219563e5211585695ec38149 |
File details
Details for the file robust_crawl-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: robust_crawl-0.1.3-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 760ed1af649c30b34eb796c6a35d0f14fa81f70feef090031260de70d7d1a7f0 |
|
MD5 | b2d1f2b3336603e57d0a6b8350bf5a3d |
|
BLAKE2b-256 | be0524bdea678d2c58221a520a88aaaaecb7fbfa325b2c03575bba539af4df5a |