Skip to main content

A library for robust crawler based on proxy pool and token bucket, support browser and requests

Project description

RobustCrawl

A library for robust cralwer based on proxy pool and token bucket, support browser and requests

Install

pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"  
export OPENAI_API_BASE="your base" # optional

brew install go

brew install mihomo

set imported proxy file (.yml / .yaml) in ./config

Config

save it in ./config/robust_crawl_config.json

{
    "wait_sec": 10,
    "max_concurrent_requests": 500,
    "GPT": {
        "model_type": "gpt-3.5-turbo"
    },
    "TokenBucket": {
        "tokens_per_minute": 20,
        "bucket_capacity": 5,
        "url_specific_tokens": {
            "export.arxiv": {
                "tokens_per_minute": 19,
                "bucket_capacity": 1
            }
        }
    },
    "Proxy": {
        "is_enabled": true,
        "start_port": 33333,
        "proxies_dir": "./proxies"
    },
    "ContextPool": {
        "num_contexts": 2,
        "work_contexts": 15,
        "have_proxy": true,
        "duplicate_proxies": false,
        "ensure_none_proxies": true,
        "download_pdf": false,
        "downloads_path": "./output/browser_downloads",
        "preference_path": "./output/broswer_config",
        "context_lifetime": 60,
        "context_cooling_time": 1
    }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robust_crawl-0.1.4.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

robust_crawl-0.1.4-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file robust_crawl-0.1.4.tar.gz.

File metadata

  • Download URL: robust_crawl-0.1.4.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for robust_crawl-0.1.4.tar.gz
Algorithm Hash digest
SHA256 a4d9669186a38dfbc29773a782b57fcafe0bd6de8a494f2e400367b9cd4fa1e8
MD5 f8f52be11cb922808b2eb42e38b73567
BLAKE2b-256 81c4ef56c20a93573eea9a447bf84d22f13a01e434d458b9eac7f6a0d083edc7

See more details on using hashes here.

File details

Details for the file robust_crawl-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: robust_crawl-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for robust_crawl-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e209cb1ac2c10eb662acf1d2b87ec544f82f3cae4a4a709e273f89f0e4f9dcf3
MD5 db8dd2597952af7b760b9c476da96f7d
BLAKE2b-256 1c70590af9021d539c5cb36fe0e6f2a6168d52257a8f64f3c30f8f43f6afd2ce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page