Skip to main content

A library for robust crawler based on proxy pool and token bucket, support browser and requests

Project description

RobustCrawl

A library for robust cralwer based on proxy pool and token bucket, support browser and requests

Install

pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"  
export OPENAI_API_BASE="your base" # optional

brew install go

brew install mihomo

set imported proxy file (.yml / .yaml) in ./config

Config

save it in ./config/robust_crawl_config.json

{
    "wait_sec": 10,
    "max_concurrent_requests": 500,
    "GPT": {
        "model_type": "gpt-3.5-turbo"
    },
    "TokenBucket": {
        "tokens_per_minute": 20,
        "bucket_capacity": 5,
        "url_specific_tokens": {
            "export.arxiv": {
                "tokens_per_minute": 19,
                "bucket_capacity": 1
            }
        }
    },
    "Proxy": {
        "is_enabled": true,
        "start_port": 33333,
        "proxies_dir": "./proxies"
    },
    "ContextPool": {
        "num_contexts": 2,
        "work_contexts": 15,
        "have_proxy": true,
        "duplicate_proxies": false,
        "ensure_none_proxies": true,
        "download_pdf": false,
        "downloads_path": "./output/browser_downloads",
        "preference_path": "./output/broswer_config",
        "context_lifetime": 60,
        "context_cooling_time": 1
    }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robust_crawl-0.1.1.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

robust_crawl-0.1.1-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file robust_crawl-0.1.1.tar.gz.

File metadata

  • Download URL: robust_crawl-0.1.1.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for robust_crawl-0.1.1.tar.gz
Algorithm Hash digest
SHA256 68f353815240dad7e38a5d2006fc74013f7c90da30706fa223ed78ef31812125
MD5 9174ce8e82eb9598602abed56223fa04
BLAKE2b-256 1f75507848f6cff462b7e206609db06b6cb505785129c13f7fac7b3a64160ab3

See more details on using hashes here.

File details

Details for the file robust_crawl-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: robust_crawl-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for robust_crawl-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1815935c4700244034ddd2729aa199a0272cef53e5b4f09a7ffaf47e81697134
MD5 bca85409e5c715e11bf93ce89eb43404
BLAKE2b-256 d18552e48e87623fa512f4d9f37d1a17d4bcc0824ef5c6e6a631cd606ad3a16a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page