Skip to main content

A library for robust cralwer based on proxy pool and token bucket, support browser and requests

Project description

RobustCrawl

A library for robust cralwer based on proxy pool and token bucket, support browser and requests

Install

pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"  
export OPENAI_API_BASE="your base" # optional

brew install go

brew install mihomo

set imported proxy file (.yml / .yaml) in ./config

Config

save it in ./config/robust_crawl_config.json

{
        "max_concurrent_requests": 500,
        "GPT": {
            "model_type": "gpt-3.5-turbo"
        },
        "TokenBucket": {
            "tokens_per_minute": 20,
            "bucket_capacity": 5,
            "url_specific_tokens": {
                "export.arxiv": {
                    "tokens_per_minute": 19,
                    "bucket_capacity": 1
                }
            }
        },
        "Proxy": {
            "is_enabled": true,
            "core_type": "mihomo", 
            "start_port": 33333,
            "config_paths": [
                "the comparative path to the proxy file, imported by clash-verge core",
                "./config/proxy.yaml"
            ]
        },
        "ContextPool": {
            "num_contexts": 5,
            "work_contexts": 15,
            "have_proxy": true,
            "duplicate_proxies": false,
            "ensure_none_proxies":  true,
            "download_pdf": false,
            "downloads_path": "./output/browser_downloads",
            "preference_path": "./output/broswer_config",
            "context_lifetime": 60,
            "context_cooling_time":1
        }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robust_crawl-0.1.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

robust_crawl-0.1-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file robust_crawl-0.1.tar.gz.

File metadata

  • Download URL: robust_crawl-0.1.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for robust_crawl-0.1.tar.gz
Algorithm Hash digest
SHA256 aa5375ad049f202cbc79285816b29828e742ac71c97c3a49494f166ae62efd91
MD5 7871b6c7e0384bb3d8eafbf6d874a5d9
BLAKE2b-256 9c69d642bd93028dc680c418337d1e2a0bb2e3bf7fad1f6f50d7ff5fb396844a

See more details on using hashes here.

File details

Details for the file robust_crawl-0.1-py3-none-any.whl.

File metadata

  • Download URL: robust_crawl-0.1-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for robust_crawl-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 84896879337df0bd51a32874c22297b569a3085d74cce001a47743f1c6cecb78
MD5 fecfe5ba2259e6617c1569502527ebe3
BLAKE2b-256 571f35266d7d003b930f0119a6a34f60ff6f925d11ddd18a733e5efd4822aef4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page