Skip to main content

A library for robust crawler based on proxy pool and token bucket, support browser and requests

Project description

RobustCrawl

A library for robust cralwer based on proxy pool and token bucket, support browser and requests

Install

pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"  
export OPENAI_API_BASE="your base" # optional

brew install go

brew install mihomo

set imported proxy file (.yml / .yaml) in ./config

Config

save it in ./config/robust_crawl_config.json

{
    "wait_sec": 10,
    "max_concurrent_requests": 500,
    "GPT": {
        "model_type": "gpt-3.5-turbo"
    },
    "TokenBucket": {
        "tokens_per_minute": 20,
        "bucket_capacity": 5,
        "url_specific_tokens": {
            "export.arxiv": {
                "tokens_per_minute": 19,
                "bucket_capacity": 1
            }
        }
    },
    "Proxy": {
        "is_enabled": true,
        "start_port": 33333,
        "proxies_dir": "./proxies"
    },
    "ContextPool": {
        "num_contexts": 2,
        "work_contexts": 15,
        "have_proxy": true,
        "duplicate_proxies": false,
        "ensure_none_proxies": true,
        "download_pdf": false,
        "downloads_path": "./output/browser_downloads",
        "preference_path": "./output/broswer_config",
        "context_lifetime": 60,
        "context_cooling_time": 1
    }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robust_crawl-0.1.3.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

robust_crawl-0.1.3-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file robust_crawl-0.1.3.tar.gz.

File metadata

  • Download URL: robust_crawl-0.1.3.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for robust_crawl-0.1.3.tar.gz
Algorithm Hash digest
SHA256 16405ca6de682a0fe261345c6afe4701660978377844854a1e0ec249b9ace945
MD5 bb971262585a875a49bc5b4a04dabba0
BLAKE2b-256 02809605493a761bcc9858f7f178838adb85e7b1219563e5211585695ec38149

See more details on using hashes here.

File details

Details for the file robust_crawl-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: robust_crawl-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for robust_crawl-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 760ed1af649c30b34eb796c6a35d0f14fa81f70feef090031260de70d7d1a7f0
MD5 b2d1f2b3336603e57d0a6b8350bf5a3d
BLAKE2b-256 be0524bdea678d2c58221a520a88aaaaecb7fbfa325b2c03575bba539af4df5a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page