Skip to main content

A library for robust crawler based on proxy pool and token bucket, support browser and requests

Project description

RobustCrawl

A library for robust cralwer based on proxy pool and token bucket, support browser and requests

Install

pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"  
export OPENAI_API_BASE="your base" # optional

brew install go

brew install mihomo

set imported proxy file (.yml / .yaml) in ./config

Config

save it in ./config/robust_crawl_config.json

{
    "wait_sec": 10,
    "max_concurrent_requests": 500,
    "GPT": {
        "model_type": "gpt-3.5-turbo"
    },
    "TokenBucket": {
        "tokens_per_minute": 20,
        "bucket_capacity": 5,
        "url_specific_tokens": {
            "export.arxiv": {
                "tokens_per_minute": 19,
                "bucket_capacity": 1
            }
        }
    },
    "Proxy": {
        "is_enabled": true,
        "start_port": 33333,
        "proxies_dir": "./proxies"
    },
    "ContextPool": {
        "num_contexts": 2,
        "work_contexts": 15,
        "have_proxy": true,
        "duplicate_proxies": false,
        "ensure_none_proxies": true,
        "download_pdf": false,
        "downloads_path": "./output/browser_downloads",
        "preference_path": "./output/broswer_config",
        "context_lifetime": 60,
        "context_cooling_time": 1
    }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robust_crawl-0.1.2.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

robust_crawl-0.1.2-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file robust_crawl-0.1.2.tar.gz.

File metadata

  • Download URL: robust_crawl-0.1.2.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for robust_crawl-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fbb75b9c2e035dbea1b55fa6c8904ba77e66eb6a77a45b50d5c3bdffdc3839e4
MD5 32d12fbf5da68c2d7180016f1b7ed82d
BLAKE2b-256 efaa9292a9238353bc76829bed524eb22a7e63870cb2c1f5bb2d5d04ae8364ba

See more details on using hashes here.

File details

Details for the file robust_crawl-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: robust_crawl-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for robust_crawl-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 979938eb11aecbb874cd6e3d5520d6225d9cec10b85768d8e8a4c8e42c4b94bc
MD5 90b9a3e6c0195176c0a0899f510770de
BLAKE2b-256 32c0a1f105327a941730b581c8bc875d743aa361088360f72590817c50737115

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page